key: cord-0682672-4yyk43xz
authors: Rout, Ranjeet Kumar; Hassan, Sk Sarif; Sheikh, Sabha; Umer, Saiyed; Sahoo, Kshira Sagar; Gandomi, Amir H.
title: Feature-extraction and analysis based on spatial distribution of amino acids for SARS-CoV-2 Protein sequences
date: 2021-11-10
journal: Comput Biol Med
DOI: 10.1016/j.compbiomed.2021.105024
sha: daa58650dc32b393ddc131dd22298b7485fc80dc
doc_id: 682672
cord_uid: 4yyk43xz

BACKGROUND AND OBJECTIVE: The world is currently facing a global emergency due to COVID-19, which requires immediate strategies to strengthen healthcare facilities and prevent further deaths. To achieve effective remedies and solutions, research on different aspects, including the genomic and proteomic level characterizations of SARS-CoV-2, are critical. In this work, the spatial representation/composition and distribution frequency of 20 amino acids across the primary protein sequences of SARS-CoV-2 were examined according to different parameters. METHOD: To identify the spatial distribution of amino acids over the primary protein sequences of SARS-CoV-2, the Hurst exponent and Shannon entropy were applied as parameters to fetch the autocorrelation and amount of information over the spatial representations. The frequency distribution of each amino acid over the protein sequences was also evaluated. In the case of a one-dimensional sequence, the Hurst exponent (HE) was utilized due to its linear relationship with the fractal dimension (D), i.e. [Formula: see text] , to characterize fractality. Moreover, binary Shannon entropy was considered to measure the uncertainty in a binary sequence then further applied to calculate amino acid conservation in the primary protein sequences. RESULTS AND CONCLUSION: Fourteen (14) SARS-CoV protein sequences were evaluated and compared with 105 SARS-CoV-2 proteins. The simulation results demonstrate the differences in the collected information about the amino acid spatial distribution in the SARS-CoV-2 and SARS-CoV proteins, enabling researchers to distinguish between the two types of CoV. The spatial arrangement of amino acids also reveals similarities and dissimilarities among the important structural proteins, E, M, N and S, which is pivotal to establish an evolutionary tree with other CoV strains.

The novel coronavirus (COVID-19) has rapidly become a major global emergency that has and continues to affect all lives around the globe. [1] [2] [3] Presently, this disease, a pandemic as announced by the WHO, is a major health concern.[4] [5] Currently, the largest genome (of size approximately 30 kb) for RNA viruses is known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). [6] [7] . Coronaviruses (CoVs) are classified into three different classes, including -CoV, -CoV, and -CoV, based on genetic and antigenic criteria. [8] [9] . The SARS-CoV-2 is classified as -CoV [10] and has received widespread research attention across the world [11] [12] [13] . Every day, new genome sequences, as well as primary protein sequences of SARS-CoV-2, are being added to databases, such as the NCBI virus database [14] [15] As of this writing, no antiviral drugs with proven efficacy nor vaccines for CoV2 prevention have been reported [16] [17] , while

researchers have yet to attain a complete understanding of the molecular biology of SARS-CoV-2 infection [18] [19] As a result, COVID-19 cases increase and have reached a global pandemic level, thus urgently requiring in-depth knowledge, infection mechanism, and other aspects of the virus-like forecasting its progression [18] [20] . Although various protein-protein interactions (PPIs) of the virus and host are known, its viral infection mechanism is not fully understood [21] [22] Therefore, identifying interactions between the SARS-CoV-2 virus proteins and host proteins will largely help to understand this mechanism and further develop treatments and vaccines [23] . As a first step, it is critical to gain clarity of SARS-CoV-2 proteins and PPIs between the virus and host proteins [24] . It is known that the protein fold depends on the number, spatial arrangement, and topological connectivity of secondary structure elements (SSEs) [25] , yet the spatial arrangement of secondary structure elements (SSEs) is not well-understood [26] . Because the geometric threedimensional structure of a protein depends on the spatial arrangement of the SSEs [27] [28] , both the spatial distribution and presence/absence of different amino acids over a primary protein sequence of SARS-CoV-2 are significant. It is also pertinent to mention that the spatial arrangement uncovers the rules that govern the folding of polypeptide chains, and the primary sequence of a protein reveals the molecular events in evolution [29] [30] .

Specifically, the alternation and spatial arrangement of amino acids over the primary sequence appear to affect the function and conformability of the protein, respectively [31] [32] [33] .

In the present study, the spatial composition of 20 amino acids across the primary proteins of SARS-CoV-2 was examined according to the Hurst exponent and Shannon entropy. A frequency analysis of the amino acids was also conducted and further compared to a similar analysis for 89 genomes of SARS-CoV-2 [34] . The usability of Shanon entropy and Hurst exponent for analysis of protein sequences is reported in [29] which is to find out correlation among all these sequences.

As of 24 March 2020, there are 944 known primary protein sequences of SARS-CoV-2 in the NCBI Virus Database ( ) [35] . Out of these sequences, only 105 sequences are distinct, although these sequence data have been taken from wide ranges of geographic locations over the world. The complete list of 105 distinct sequences, which are denoted , ,…, , with their corresponding accessions is provided at the end of the article in Appendix C. These 105 distinct protein sequences were considered in this study. The SARS-CoV and MERS-CoV, the SARS-CoV-2 genome comprises of 12 open reading frames (ORFs) in number. Genes encoding structural proteins such as spike (S), membrane (M), envelope (E), and nucleocapsid (N), are present in the remaining one-third of its genome spanning from the 5′ to the 3′ terminal, along with several genes encoding non-structural proteins (NSPs) and accessory proteins scattered in between is shown in Figure 1 .

[36]

The 20 amino acids are distinguished below:

• Herein, we represent the studied amino acids as corresponding to A, C, F, G, H, I, L, M, N, P, Q, S, T, V, W, Y, D, E, K, and R respectively. Each primary protein sequence was decomposed into 20 different binary sequences of and , according to the following rule: Given a primary protein sequence of SARS-CoV-2 for every amino acid , where to , put wherever is present and elsewhere put .

Consequently, for every given primary protein sequence for all sequences , there are 20 binary sequences corresponding to the 20 different amino acids , . The length of these complete 105 primary protein sequences widely varies from 13 to 7097. One complete SARS-CoV-2 protein sequence, N99, has the smallest length of 13, and one protein sequence, N26, has the largest length of 7097. There are 6, 3, 8, Translation of this ssRNA results in the formation of two polyproteins, namely pp1a and pp1ab that are further sliced to generate numerous nonstructural Proteins (NSPA). The remaining ORFS encode for various structural and accessory proteins that help in the assembly of the viral particle and evading immune response.

To characterize the amino acid spatial distribution over the primary protein sequences of SARS-CoV-2, the Hurst exponent and Shannon entropy were applied as parameters, and the amino acid density/frequency analysis J o u r n a l P r e -p r o o f was performed. Unsupervised machine learning was mostly utilized for analysis of gene and genome sequences and also used for intra-protein analysis. Markov Clustering and Affinity Propagation procedures were compared directly to the method described in [41] [42] and K-means clustering techniques in [43] . K-means algorithm is better for analyzing inter and intra class analysis of protein sequences [44] . A recent application of minimum variance cluster analysis for hierarchical agglomerative clustering technique was performed well and discussed in [45] and also identified groups of molecular systems to enhance insight into peptide dynamics. K-mean clustering algorithm is used to develop homogeneous subclasses inside the data. These data points in each cluster are as analogous as possible according to a widely used distance measure viz. Euclidean distance. Based on the performance and applicability one of the most commonly used simple clustering techniques is the Kmeans clustering [42] [46] . In this paper, k-mean clustering algorithm has been used to generate 10 clusters for respective amino acids with the 105 SARS-CoV-2 datasets. The implementation of the spatial feature extraction has been performed using MATLAB-2016a version, on Microsoft 2010 OS. The statistical analysis of these spatial features is also analyzed with the help of STATISTICA 10.0 software in the upcoming sections. The following section briefly describes these methods with reference to similar works. [47] [48] [49] . 

The HE lies in the interval , where HE is strictly less than for rough anti-correlated sequences and lies in the ranges -for positively correlated sequences. If HE = , then the sequence depicts its randomness with white noise. [50] [51] [52] . The HE of a binary sequence is defined as given in Equ. 1 where n is the length of the sequence:

∑ and

The autocorrelation of the binary representations of each amino acid over the SARS-CoV-2 protein sequences was obtained by measuring the Hurst exponent.

There are two kinds of Shannon entropy that were considered in this present study.

• Binary Shannon entropy: The entropy of a Bernoulli process is measured with probability of the two outcomes , which is defined in equation 2:

where frequency probabilities of 1's and 0's are respectively and ; is the length of the binary sequence; and is the number of 1's in the binary sequence of length [53] . The binary Shannon entropy is a measure of the uncertainty in a binary sequence. When probability , the event is certain to never occur; so there is no uncertainty, and entropy is . When probability , the result is certain; thus entropy must be .

When , the uncertainty is at a maximum and consequently, the SE is .

• Amino acid conservation Shannon entropy: Protein Post Translational Modification (PTM) is an important biological mechanism for expanding the genetic code [54] [55] . To find the conservation of amino acids in primary protein sequences, Shannon entropy is deployed. For a given protein sequence, the SE is calculated as follows:

where represents the occurrence frequency of amino acid in the sequence.

Over the primary protein sequences of SARS-CoV-2, we aimed to explore the amino acid frequency distributions and corresponding statistical descriptions [11] [56] . The density of the amino acids over a primary protein sequence can also be found using the following formula:

where is an amino acid present in the primary protein sequence ; is the length of sequence ; and is the frequency of amino acid in sequence . This amino acid density would clarify the richness of essential amino acids in contrast to others.

J o u r n a l P r e -p r o o f

Herein, the positive/negative trend of the spatial distribution of the 20 amino acids over the SARS-CoV-2 protein sequences based on the Hurst exponent and Shannon entropy is reported. As mentioned earlier, the Hurst exponent implies the fractality (organized non-linearity) of the spatial representations. Also, the amount of uncertainty in the presence/absence of amino acids over the protein sequences was determined through Shannon entropy measurements, which provide conservation information about the amino acids. Based on the frequency distributions of all amino acids over the SARS-CoV-2 protein sequences, 14 SARS-CoV protein sequences were subsequently compared with 105 SARS-CoV-2 proteins.

For is not present in the protein sequences N3, N80, N97, N98 and N99 of the SARS-COV-2. The spatial organization of amino acid H is random (neither trending nor negatively autocorrelated) in the protein sequences N5, N15, N88, N89, N90, N91, N92, N93, N94, and N95, which belong to cluster 2 as shown in Table 6 (Appendix A). Cluster 2 contains ten sequences (N68, N88, N89, N90, N91, N92, N93, N94, N95, and N99) with no HE (*), which indicates that the corresponding binary sequences , , ,

, and are completely free from amino acid (C). Protein sequences N68 and N81 lack amino acid A 4 (G) (conditionally essential), as can be seen in Table 5 (Appendix A), while N99 is the only sequence that does not have essential amino acid A 6 (I). The spatial distribution of amino acid A 6 (I) over the protein sequence N102 is truly random since the HE is 0.509, whereas the other 104 sequences are trending with HEs greater than 0.5. The spatial arrangements of amino acid A 7 (L) over these proteins are neither random nor trending as the HE is greater than 0.5 but less than 0.6.

The HE of the binary representation of the amino acids forming eight clusters ranges from to with a standard deviation between 0.04 to 0.111. The binary representation of the spatial organization of nonessential amino acid A 12 

The protein sequences of different lengths, ranging from 13 to 419, are provided below. Table 4 lists the amino acid(s) that are not present in the sequences. The protein sequence N99 of length 13 does not contain some essential, conditionally essential, and nonessential amino acids, including C, H, M, P, T, W, Y, E, K and R. The largest sequences N88, N89, N90, N91, N92, N93, N94, N95 of length 419 do not contain amino acid C. It is noted that amino acid M is present over all the protein sequences, except N99, which has the smallest length of 13. Also, it is has been observed that the essential amino acids L, M, F and V as well as non-essential amino acids A, D, N and S are present in all the protein sequences of SARS-CoV-2. In addition, the six conditionally essential amino acids were not found to be essential for all the proteins of SARS-CoV-2. Proteins that have a length greater than 419 contain all 20 amino acids. It is reported that the presence of amino acid I, G and V is of primordial importance, in this study it has also been found that N99 does not contain I and amino acid G is not present in N68, N81 sequences.

It is also noted that amino acid H is randomly spatially distributed over protein sequences N5, N15, N88, N89, N90, N91, N92, N93, N94 and N95, as observed in the previous subsections. The essential hydroxyl amino acid M is randomly arranged over proteins N80 and N102. Also, amino acid L is distributed over the protein sequence N102 randomly, while only amino acid K is randomly spread over N104. In sequences N98 and N102, amino acid R is distributed with a negative trend ( ). Also, the amino acids K, Y, S, Q, N, and F are negatively trending over the protein sequences N103, N80, N7, N100, N2, and N5, respectively. Therefore, amino acids C, G, P, T, W, and E are distributed over all 105 proteins with positive autocorrelation (positively trending).

Here, we explore the correlation (of trending behaviors) of the amino acid distribution over 105 proteins of SARS-CoV-2. The correlation matrix of ten amino acids, A, C, F, G, H, I, L, M, N and P, versus another ten amino acids Q, S, T, V, W, Y, D, E, K and R, is presented below. The spatial distribution of amino acid A with the same distribution of amino acids Q, T, V, W, and Y is positively correlated based on the HEs shown in Table 5 Table 5 . The correlation-based on HEs of the spatial distribution is also demonstrated in the graphs in Fig. 4 . It is worth mentioning that the correlation matrix (presented in Table 5) also displays the negative correlations of the spatial distribution of the proteins An example of the correlation (correlation coefficient r: 0.443) between the spatial distribution (autocorrelation) of amino acid M and the spatial distribution of amino acid Y is given below in Fig. 3 .

J o u r n a l P r e -p r o o f (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid .

J o u r n a l P r e -p r o o f (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . The following subsection discuss the amount of uncertainty/certainty of the presence of amino acids over the protein sequences.

J o u r n a l P r e -p r o o f

For amino acids , the Shannon entropy (SE) was determined for the 105 binary sequences The SE of the binary representation of the amino acids forming five clusters ranges from to with a standard deviation between 0.0448 to 0.0919. The SE of the spatial distribution of amino acid in protein sequence N68 was determined to be 0.121, which is the lowest amount of uncertainly compared to the SE of other amino acids. In clusters 4 and 1, almost all the protein sequences had an SE less than 0.5, indicating the definite presence and absence of a particular amino acid over the protein sequences. The amount of uncertainly is high for protein sequences N3 and N99 with lengths of 198 and 13, respectively. Amino acids and are absent from protein sequence N99, with an SE less than 0.5, as shown in Tables 35 and 36, respectively. The amino acid (V) is present over all 105 proteins, and hence, none of the binary representations has SE = 0. For the amino acid V, the SE of N74 and N77 is 0.391, which implies the presence of this amino acid over the proteins has good certainty, and N96 and N97 have the maximum uncertainty of SE = 0.665. Cluster 1 contains five protein sequences, in which amino acid is absent, and hence, SE = 0. Also, SE = 0 for the binary spatial representations of N99 and N103 for amino acid , N80 and N99 (belonging to cluster 2) for amino acid , N80, N81 and N99 for amino acid , and N81 and N99 amino acid due to the absence of these amino acids. It is pertinent to note that amino acids and are present over all 105 proteins with certainty ( . Most of the proteins in the largest cluster 2 including other clusters contain amino acid that is spatially distributed with certainty.

The SE of the binary representation of the amino acids forming six clusters ranges from to with a standard deviation between 0.0749-0.852. Amino acid is absent from the primary protein sequences N68 and N81, and consequently, SE = 0 implies no uncertainty. Similarly, SE = 0 for the binary spatial representations of protein sequence N99 for amino acid , sequences N81, N99 and N103 for amino acid (P), and sequences N96 and N97 for amino acid (Q). Amino acid is spread spatially with certainty over the proteins N2 (length of 138) and N89, N90, N91, N92, N93, N94 and N95 (lengths of 419) in cluster 3.

Clusters 1 and 5 for amino acid and cluster 1 for amino acids and contain the majority of the protein sequences, where the presence of these amino acids is spread over the proteins with almost certainty.

Comparatively, clusters 2 and 6 contain five protein sequences, where the absence of the amino acid is spread with almost certainty. Cluster 3 contains one protein sequence N80 where the spatial distribution has SE = 0.562, which indicates that the absence of amino acid over the protein is without uncertainty.

The SE of the binary representation of the amino acids forming seven clusters each ranges from to with a standard deviation between 0.0667 to 0.0765. It was found that SE = 0 for the spatial distribution of amino acid in the protein sequences N68, N88, N89, N90, N91, N92, N93, N94, N95 and N99, which indicates the amount of uncertainty is zero. In other words, the absolute absence of amino acid over these proteins and the spatial presence of amino acid C over the protein sequences of other clusters have low uncertainty (high certainty). The SE is greater than 0.5 for the binary representations of amino acid over the proteins N81 and N99, and consequently, the amount of uncertainty is lowering. In other clusters containing the other protein sequences, the spatial presence of amino acid over the protein sequences has low uncertainty (high certainty).

The SE of the binary representation of the amino acids forming eight clusters ranges from to with a standard deviation between 0.0459 to 0.0749. Because amino acid is absent from proteins N3, N80, N97, N98 N99 and amino acid is absent from N99 (smallest length of 13), SE = 0 for the amino acids, implying there is no uncertainty. In addition, SE = 0.078 for the spatial representation of the presence and absence of amino acid over the proteins N88, N89, N90, N91, N92, N94 and N95 (lengths of 419) belonging to cluster 4); hence, the spatial distribution is more certain/orderly. All the clusters except cluster 6 contain only protein sequences over which amino acid is spatially distributed with certainty, whereas cluster 6 contains two sequences N81 (length of 43) and N68 (length of 61), where the absence of the amino acid dominates the presence with certainty.

It is pertinent to mention that SE = 0 for the binary representations of amino acid that is absent from protein sequence , which has been demonstrated in this study. It was also observed that maximum SE was obtained for the spatial distribution of amino acids over lengthy sequences, such as N99, N80, etc. Interestingly, for some given amino acid , the same SE was obtained for some spatial distributions of some protein sequences , irrespective of their lengths, for many values of . This essentially suggest that the probability of the presence of amino acid over these protein sequences is the same.

Further, we explored the correlation in the amount of uncertainty between the spatial distributions of the 20 amino acids over the proteins of SARS-CoV-2. Table 6 presents the correlation matrix of ten amino acids (A, C, F, G, H, I, L, M, N and P) versus another ten amino acids (Q, S, T, V, W, Y, D, E, K and R). Table 6 . Correlation matrix of SEs of present amino acids over the protein sequences. Based on the SEs, the spatial distribution of amino acid A was found to be positively correlated with the distributions of amino acids Q, S, D, K and R, as shown in Table 6 . Likewise, the spatial distribution of amino acid C is positively correlated with amino acids T, V, Y and E. Similarly, the positive correlations between the spatial distributions of amino acids F, G, H, I, L, M, N and P and the other amino acids are established in the correlation matrix in Table 6 , which also shows negative correlations. Figure 5 . Shows the Plot of SEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . of amino acid R with the spatial distribution of amino acid P is given in Fig. 6 . were formed, and the respective SE plots and histograms for the 105 protein sequences are provided in Table 7 . It can be observed that the Shannon entropy of amino acid conservation along the protein sequences of SARS-CoV-2 ranges from 0.7 to 0.982. Since the SE is close to 1, meaning uncertainty is at a maximum, all amino acids must be uniformly distributed over the protein sequences. More than 50% of the proteins sequences (54) belonging to cluster 2 of SARS-CoV-2 have SE = , which further implies that the amino acids are almost uniformly spread over the sequences. Subsequently, the frequency analysis of the amino acids over the proteins is given in the following subsection.

In this section, the frequencies of the amino acids in the 105 SARS-CoV-2 protein sequences are statistically compared, as shown in Figs. 9 and 10.

J o u r n a l P r e -p r o o f J o u r n a l P r e -p r o o f A correlation matrix between the frequency distribution of amino acids over the 105 SARS-CoV-2 protein sequences is provided in Table 8 , and the respective correlation graphs are illustrated in Fig. 11 . It can be observed that the correlation coefficient is very close to 1, which indicates significant correlations between the frequencies of each amino acid over the proteins. For instance, the correlation coefficient between the frequency distributions of amino acids A (Aliphatic) and K (Basic) is 1, as illustrated in Fig. 12 , meaning strong correlation.

J o u r n a l P r e -p r o o f Overall, it is observed that protein sequences of the same length have very similar frequency distributions of the twenty amino acids. proteins (S1, S2, S11) with their accessions are given here in Table 9 . It is noted that the protein with the accession ACU31032 (S14) is a spike protein of length 1241 as mentioned in the NCBI database. The spike protein (S-protein) is a large type I transmembrane protein of length not exceeding 1400 amino acids. The spike protein has an important function in the case of SARS-CoV.

[58] [59] Among all other proteins of SARS-CoV, spike protein is the main antigenic component that is responsible for inducing host immune responses, neutralizing antibodies, and/or protective immunity against virus infection [60] . We, therefore illuminate here the spatial representations of the amino acids over the spike protein including the other 13 proteins as mentioned in Table 10 . The HE, SE, and frequency distributions are given in the following and compared with the SARS-CoV2 proteins.

J o u r n a l P r e -p r o o f protein S12 which is a hypothetical protein. It is noted that the HE is kept blank for the cases where the spatial distribution of an amino acid is completely a sequence of zeros i,e. absence of the amino acid over the protein.

Below in Table 11 , we derive the correlation coefficients of the HEs of the spatial representations of the amino acids over the 14 SARS-CoV proteins. Table 11 . It is noted that the SE is turned out to be zero for the cases where the spatial distribution corresponding to an amino acid that is absent over a protein. The spatial distribution of amino acids over the proteins of SARS-CoV is all without much uncertainty except for three cases where the SEs are greater than 0.5

where the absence of amino acids dominates in terms of certainty. The correlation coefficients of the SEs of the spatial distributions of the amino acids over the 14 SARS-CoV proteins are given in Table 12 . It is observed that the correlations among the SEs of the spatial distributions of the amino acids over the proteins are not significantly up as tabulated in Table 12 . The highest positive correlation based on SEs of the spatial distributions of the amino acid C with that of Y is turned up as 0.572. 

Previous reports state that the genomes of SARS-CoV and SARS-CoV-2 exhibit similar protein sequences.

However, we found that the spatial arrangement of amino acids over the studied protein sequences is certainly different, contributing to differences between proteins. This study reveals the hidden spatial arrangement of the amino acids of SARS-CoV-2 and SARS-CoV1. Specifically, the spatial arrangements of amino acids over the Authors' Contribution:

SH had initiated the problem for the study, and RKR and SH executed the results from the data. SH, RKR, SS, SU, KSS, and AHG analyzed and interpreted the results. SH was a major contributor in writing the manuscript.

All authors read and approved the final manuscript. Appendix B: Hurst Exponent of 105 number of SARS-CoV-2 binary sequences, for the amino acid A3 (F), for the amino acid A4 (G), for the amino acid A5 (H), for the amino acid A6 (I), for the amino acid A7 (L), for the amino acid A8 (M), for the amino acid A9 (N), and for the amino acid A10 (P), for the amino acid A11 (Q), (c) and (d) for the amino acid A12 (S), (e) and (f) for the amino acid A13 (T), (g) and (h) for the amino acid A14 (V), (i) and (j) for the amino acid A15 (W), (k) and (l) for the amino acid A16 (Y), (m) and (n) for the amino acid A17 (D), (o) and (p) for the amino acid A18 (E), (q) and (r) for the amino acid A19 (K), (s) and (t) for the amino acid A20 (R). QIJ96471  38  15  P158  QIJ96501  38  16  P161  QIJ96521  38  17  P184  QII87803  38  18  P192  QII87791  38  19  P201  QII87827  38  20  P212  QII87815  38  21  P223  QII57176  38  22  P233  QII57276  38  23  P243  QII57346  38  24  P252  QII57226  38  25  P264  QII57286  38  26  P269  QII57236  38  27  P285  QII57186  38  28  P308  QII57336  38  29  P311  QII57216  38  30  P325  QII57206  38  31  P333  QII57326  38  32  P342  QII57306  38  33  P353  QII57316  38  34  P362  QII57296  38  35  P364  QII57196  38  36  P376  QIA98562  38  37  P387  QII57246  38  38  P394  QII57256  38  39  P413  QII57266  38  P741  QHU36836  75  261  P753  QHU36856  75  262  P761  QHU36866  75  263  P771  QHU36826  75  264  P778  QHU36846  75  265  P795  QHU79206  75  266  P802  QHR84451  75  267  P809  QHR63252  75  268  P819  QHR63262  75  269  P829  QHR63272  75  270  P839  QHR63282  75  271  P849  QHR63292  75  272  P856  QHO62113  75  273  P865  QHQ82466  75  274  P880  QHQ71965  75  275  P890  QHQ71975  75  276  P891  QHO62108  75  277  P900  QHO62879  75  278  P912  QHN73812  75  279  P921  QHN73797  75  280  P927  QHO60596  75  281  P939  QHD43418  75  282  P16  YP_009725303  83  283  P12  YP_009725305  113  284  P23  YP_009724395  121  285  P24  YP_009724396  121  286  P31  QIK50443  121  287  P35  QIK50444  121  288  P48  QIK50453  121  289  P49  QIK50454  121  290  P54  QIK50422  121  291  P57  QIK50423  121  292  P68  QIK50432  121  293  P69  QIK50434  121  294  P73  QIK02959  121  295  P74  QIK02960  121  296  P85  QIK02969  121  297  P86  QIK02970  121  298  P96  QIK02949  121  299  P97  QIK02950  121  300  P107  QIJ96479  121  301  P109  QIJ96478  121  302  P116  QIJ96489  121  303  P118  QIJ96488  121 

Clinical features of patients infected with 2019 novel coronavirus in

A Novel Coronavirus from Patients with Pneumonia in China

Consideration on the strategies during control

On spatial molecular arrangements of SARS-CoV2 genomes of Indian patients

Spatial Distribution of Amino Acids of the SARS-CoV2 Proteins

Another Decade, Another Coronavirus

A novel coronavirus outbreak of global health concern

Genomic variance of the 2019-nCoV coronavirus

Zoonotic origins of human coronaviruses

The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

A Genomic Perspective on the Origin and Emergence of SARS-CoV-2

The proximal origin of SARS-CoV-2

On the origin and continuing evolution of SARS-CoV-2

Database resources of the National Center for Biotechnology Information

Virus Variation Resource-improved response to emergent viral outbreaks

Research and Development on Therapeutic Agents and Vaccines for COVID-19 and Related Human Coronavirus Diseases

COVID-19, an emerging coronavirus infection: advances and prospects in designing and developing vaccines, immunotherapeutics, and therapeutics, Hum. Vaccines Immunother

Explaining machine learning based diagnosis of COVID-19 from routine blood tests with decision trees and criteria graphs

Overlapping and discrete aspects of the pathology and pathogenesis of the emerging human pathogenic coronaviruses SARS-CoV, MERS-CoV, and 2019-nCoV

A multiple combined method for rebalancing medical data with class imbalances

Protein-protein interactions of human viruses

Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method

Structural genomics of SARS-COV-2 indicates evolutionary conserved functional regions of viral proteins

A SARS-CoV-2-human protein-protein interaction map reveals drug targets and potential drug-repurposing

Protein structure comparison: implications for the nature of "fold space

Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions

Classification of Mer Proteins in a Quantitative Manner

A geometric algorithm to find small but highly similar 3D substructures in proteins

UmerSaiyed, Intelligent Classification and Analysis of Essential Genes Using Quantitative Methods

New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage

Hydrophobie distribution and spatial arrangement of amino acid residues in membrane proteins

Intercalating amino acid guests into montmorillonite host

EightyDVec: a method for protein sequence

Markov clustering versus affinity propagation for the partitioning of protein interaction graphs, BMC Bioinforma

Unsupervised feature selection using an improved version of Differential Evolution

The global k-means clustering algorithm

An automatic tool to analyze and cluster macromolecular conformations based on self-organizing maps

Clustering algorithms applied on analysis of protein molecular dynamics

Validating clustering of molecular dynamics simulations using polymer models, BMC Bioinforma

The variations of human miRNAs and Ising like base pairing models

Ranking and clustering of Drosophila olfactory receptors using mathematical morphology

Analysis of Purines and Pyrimidines distribution over miRNAs of Human, Gorilla, Chimpanzee, Mouse and Rat

Fractal Analysis of Time Series and Distribution Properties of Hurst Exponent

Estimation of Hurst exponent revisited

Introducing fractal dimension algorithms to calculate the Hurst exponent of financial time series

Divergence Measures Based on the Shannon Entropy

The Shannon information entropy of protein sequences

Shannon information entropy in the canonical genetic code

The SARS-CoV S glycoprotein: Expression and functional characterization

Characterization of severe acute respiratory syndrome-associated coronavirus (SARS-CoV) spike glycoprotein-mediated viral entry

The spike protein of SARS-CoV -A target for vaccine and therapeutic development

Receptor-binding domain of SARS-CoV spike protein induces highly potent neutralizing antibodies: Implication for developing subunit vaccine

Treatment of SARS with human interferons

 QIK50421  61  103  P60  QIK50431  61  104  P78  QIK02958  61  105  P87  QIK02968  61  106  P98  QIK02948  61  107  P104  QIJ96477  61  108  P111  QIJ96487  61  109  P127  QIJ96507  61  110  P138  QIJ96527  61  111  P149  QIJ96467  61  112  P160  QIJ96497  61  113  P165  QIJ96517  61  114  P191  QII87786  61  115  P195  QII87798  61  116  P203  QII87822  61  117  P217  QII87810  61  118  P221  QII57172  61  119  P231  QII57272  61  120  P240  QII57342  61  121  P251  QII57222  61  122  P262  QII57282  61  123  P276  QII57232  61  124  P281  QII57182  61  125  P291  QII57302  61  126  P307  QII57332  61  127  P317  QII57212  61 J o u r n a l P r e -p r o o f 128   P323  QII57202  61  129  P329  QII57322  61  130  P347  QII57312  61  131  P358  QII57292  61  132  P366  QII57192  61  133  P377  QIA98558  61  134  P389  QII57242  61  135  P391  QII57252  61  136  P405  QII57262  61  137  P417  QHS34550  61  138  P430  QIA98587  61  139  P435  QIH55225  61  140  P447  QIH45027  61  141  P457  QIH45037  61  142  P462  QIH45047  61  143  P467  QIH45057  61  144  P487  QIG55998  61  145  P498  QIE07475  61  146  P510  QIE07465  61  147  P521  QIE07455  61  148  P531  QIE07485  61  149  P542  QID98798  61  150  P552  QID21052  61  151  P559  QID21072  61  152  P566  QID21062  61  153  P578  QIC53208  61  154  P593  QIC53217  61  155  P596  QIB84677  61  156  P607  QIA98600  61  157  P616  QIA98610  61  158  P636  QIA20048  61  159  P645  QHZ87586  61  160  P653  QHZ87596  61  161  P661  QHZ00393  61  162  P665  QHZ00362  61  163  P681  QHZ00403  61  164  P690  QHZ00383  61  165  P696  QHW06053  61  166  P702  QHW06063  61  167  P722  QHW06043  61  168  P732  QHU79198  61  169  P736  QHU36838  61  170  P752  QHU36858  61  171  P758  QHU36868  61 J o u r n a l P r e -p r o o f 172   P769  QHU36828  61  173  P775  QHU36848  61  174  P791  QHU79208  61  175  P798  QHR84453  61  176  P811  QHR63254  61  177  P821  QHR63264  61  178  P831  QHR63274  61  179  P841  QHR63284  61  180  P851  QHR63294  61  181  P867  QHQ82468  61  182  P876  QHQ71967  61  183  P882  QHQ71977  61  184  P902  QHO62881  61  185  P906  QHN73814  61  186  P918  QHN73799  61  187  P928  QHO60598  61  188  P941  QHD43420  61  189  P25  YP_009724392  75  190  P37  QIK50440  75  191  P43  QIK50450  75  192  P53  QIK50419  75  193  P67  QIK50429  75 QIG56001  419  701  P500  QIE07478  419  702  P508  QIE07468  419  703  P519  QIE07458  419  704  P526  QIE07488  419  705  P536  QID98801  419  706  P550  QID21055  419  707  P562  QID21075  419  708  P564  QID21065  419  709  P582  QIC53211  419  710  P584  QIC53221  419  711  P600  QIB84680  419  712  P609  QIA98602  419  713  P618  QIA98613  419  714  P638  QIA20052  419  715  P643  QHZ87589  419  716  P651  QHZ87599  419  717  P659  QHZ00396  419  718  P668  QHZ00365  419  719  P678  QHZ00386  419  720  P679  QHZ00406  419  721  P707  QHW06066  419  722  P708  QHW06056  419  723  P719  QHW06046  419  724  P729  QHU79201  419  725  P737  QHU36841  419  726  P751  QHU36861  419  727  P757  QHU36871  419  728  P768  QHU36831  419  729  P776  QHU36851  419  730  P788  QHU79211  419  731  P796  QHR84456  419  732  P815  QHR63258  419  733  P825  QHR63268  419  734  P835  QHR63278  419  735  P845  QHR63288  419  736  P855  QHR63298  419  737  P858  QHO62115  419  738  P864  QHQ82471  419  739  P873  QHQ71970  419  740  P881  QHQ71980  419  741  P895  QHO62110  419  742  P899  QHO62884  419  743  P911  QHN73817 QHN73802  419  745  P926  QHO60601  419  746  P944  QHD43423  419  747  P4  YP_009725300  500  748  P7  YP_009725309  527  749  P10  YP_009725308  601  750  P18  YP_009725298  638  751  P9  YP_009725307  932  752  P418  QHS34546  1272  753  P28  YP_009724390  1273  754  P38  QIK50438  1273  755  P44  QIK50448  1273  756  P58  QIK50417  1273  757  P64  QIK50427  1273  758  P71  QIK02954  1273  759  P81  QIK02964  1273  760  P95  QIK02944  1273  761  P102  QIJ96473  1273  762  P119  QIJ96483  1273  763  P126  QIJ96503  1273  764  P136  QIJ96523  1273  765  P150  QIJ96463  1273  766  P156  QIJ96493  1273  767  P170  QIJ96513  1273  768  P178  QII87794  1273  769  P181  QII87806  1273  770  P182  QII87782  1273  771  P198  QII87818  1273  772  P229  QII57268  1273  773  P246  QII57168  1273  774  P254  QII57218  1273  775  P260  QII57278  1273  776  P268  QII57338  1273  777  P274  QII57228  1273  778  P282  QII57178  1273  779  P294  QII57161  1273  780  P304  QII57328  1273  781  P314  QII57208  1273  782  P321  QII57198  1273  783  P337  QII57318  1273  784  P339  QII57298  1273  785  P345  QII57308  1273  786  P363  QII57288  1273  The spatial representation and distribution frequency of amino acids were examined. The Hurst exponent and entropy were applied to fetch the autocorrelation. The simulation results enables to distinguish between the two types of CoV. The spatial arrangement reveals the important information of structural proteins. o This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.o The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript o The following authors have affiliations with organizations with direct or indirect financial interest in the subject matter discussed in the manuscript:Author's name