key: cord-0189630-q716z9wk authors: Ali, Sarwan; Bello, Babatunde; Chourasia, Prakash; Punathil, Ria Thazhe; Zhou, Yijing; Patterson, Murray title: PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences date: 2022-01-06 journal: nan DOI: nan sha: cee7e881c1d7969c7b4b89d941b399a77a88c789 doc_id: 189630 cord_uid: q716z9wk COVID-19 pandemic, is still unknown and is an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona-) viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among avians, bats, camels, swines, humans and weasels, to name a few. We propose a feature embedding based on the well-known position-weight matrix (PWM), which we call PWM2Vec, and use to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications such as determining protein function, or identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs in the context of host classification from viral sequences to generate a fixed-length feature vector representation. The results on the real world data show that in using PWM2Vec, we are able to perform comparably well as compared to baseline models. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. The coronavirus disease 2019 pandemic is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The pandemic has put millions of people at risk in numerous countries worldwide and made an unprecedented public health crisis [1] . Although the origin of COVID-19 (SARS-CoV-2) in humans is still unknown, there are many theories that it could have been transferred to humans from bats [2] . Likewise, several related coronaviruses (CoVs), such as SARS (SARS-CoV), Figure 1 . The genome of a coronavirus ranges from 26-32kb in length [5] , each of them coding for two non-structural and four structural proteins. The non-structural proteins are coded by open reading frame (ORF) 1a and 1ab, where ORF1ab contains the RNA-dependent RNA polymerase gene (RdRp). The structural genes include spike (S), envelope (E), membrane (M) and nucleocapsid (N). The S gene region encodes the spike protein, which is responsible for attaching the virus to receptors on the host cell membrane. methods have been developed and applied for constructing phylogenetic trees, including the most similar supertree algorithm (MSSA) method [25] , MRP method [26] , and approximate maximum likelihood (ML) supertree method [27, 28] , similar methods upon which the state-of-the art Nextstrain [22] and IQTREE2 [23] have been developed. However, these methods of building a phylogenetic tree for CoVs require a high computational complexity, and the vast volume of sequences data size can cause a scalability issue for the phylogenetic-based approaches [29] . For example, Nextstrain [22] is able to construct trees on thousands of sequences, while IQTREE2 [23] is able to scale to tens of thousands of sequences. There are currently millions of sequences available on GISAID alone -clearly viable alternatives are necessary. Here we study machine learning clustering and classification as an alternative to phylogenetic tree building. Some effort have been made to study the coronavirus host data [30] by using one-hot embedding (OHE) approach to get fixed length feature embedding for the spike sequence. Although OHE gives better predictive performance, one of its drawbacks is the high dimensionality of the feature vectors. Also, the columns of the OHE based vector have a linear relationship with each other, which means that one variable can be easily predicted using the other variables. This behavior can cause problems of parallelism and multicollinearity (when multiple features are correlated with each other) in high dimensions. Authors in [31, 32] using the coronavirus spike sequences to classify different variants of COVID-19 using k-mers based frequency vectors. Researchers have performed clustering on the COVID-19 spike sequences using the same k-mers based frequency vector generation approach [33, 34] . Although their approaches are effective in terms of predictive performance, the dimensionality of feature vector representation is still high, which can creates the very well known problem in machine learning the problem of the curse of dimensionality. Moreover, since for each k-mer, it is required to find the appropriate bin dedicated for that specific k-mer, this 'Bin Matching' could be expensive in terms of computational cost. Another possible solution, which is what we propose here, is the use of the position weight matrix (PWM), sometimes also called as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM) [35] . It is a good representation for motifs in biological sequences. A motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. The PWM is a application of entropy and relative entropy towards identifying transcription factor binding sites (TFBSs), for example. A PWM contains information about frequency of nucleotide for each position in form of weights, these log-odds or log-probability weights are used for computing the binding affinity score. The PWM can be used to distinguish the binding sites from the sequence, a well known method for de novo motif sequence finding. If we do not know about a possible motif in given sequence, there are methods such as expectation maximization (EM) and Gibbs sampling which uses the PWM. Inspired by this application, we compute an absolute score from the PWM while scanning the sequence for "motifs" (here k-mers) of our sequence using a sliding window (of size k) to scan the sequence and computing absolute score as shown in Figure 4 . We can find relevant information of the motifs based on score calculated from the PWM. The higher the score, the more relevant the k-mer is. In this paper, we propose an approach, called PWM2Vec, which is also basic implementation of position weight matrix (PWM) to generate feature vector representation from coronavirus spike sequences. Given a spike sequence, we first extract k-mers. From the k-mers, we generate the PWM (see Figure 3 ). After that, we assign a score to each k-mer using PWM to design a feature embedding and apply machine learning methods like classification and clustering on this feature embedding. While this is inspired by methods for finding motifs (e.g., TF-promoter binding sites), here our goal to obtain a numerical representation of these k-mers generated from each sequence. Our contributions in this paper are as follow: 1. We propose an approach to generate fixed-length numerical representation of spike sequences using PWMs. The generated feature vectors could be used as input for any machine learning algorithm for tasks such as classification and clustering. 2. Our proposed feature embedding approach contains more compact information and give results better than the baselines in terms of classification and clustering. 3. Our feature vector contains fewer dimensions as compared to k-mers and one-hot encoding based feature vectors (≈ 24 times fewer dimensions than one-hot encoding and ≈ 4 times fewer dimensions than k-mers based embedding), which improves the runtime for classification and clustering algorithms. 4. We performed statistical analysis on the data and show importance of different positions of amino acids that play key role towards classification of different hosts, as well as to validate the compactness of our proposed embedding from an orthogonal point of view. The rest of the paper is organised as follows: Section 2 contains the previous work related to the sequence classification in general and coronavirus spike sequence classification in particular. Section 3 contains the detail about our proposed alignment free method for spike sequence classification. Section 4 contains the experimental setup and dataset collection and statistics detail. The results for our proposed method are given in Section 5. Finally, we conclude our paper in Section 6. Several machine learning approaches based on k-mers have been proposed in the literature for classification and clustering tasks [31, 33, 36, 37] . More specifically, there are a lot of classical algorithms for sequence classification [38, 39] . Although these methods are proven to be useful in respective studies, it is not clear if they can be used in context of coronavirus data. Also, another major problem with all those methods is high computational complexity of the algorithms (because of high dimensional representation of data), which can result in the higher runtime of the underlying classification algorithms. Postition weight matrix (PMW) based approaches have been successfully applied for diverse sequence analysis, motifs predictions and identification studies. Several popular software applications or web servers have been build based on the implementation of PWM, e.g., the PWMscan software package [40] and PSI-BLAST [41] . Also, a many other advanced algorithms have been implemented to optimize PWM technique: examples include MEME (multiple EM for motif elicitation) [42] implemented based expectation maximum (EM) and Gibbs Sampler [43] for de novo motifs discovery that uses Gibbs sampling algorithms [44] , [45] . The MEME EM algorithm basically finds an initial motif and repeatedly use EM steps improve motifs until the PWM values do not improve further [42, 46] . Also, the BaMM (Bayesian Markov model) algorithm build based on the Markov to model correlations among nucleotides at other positions -since the PWM cannot because the method assumes probabilities at different sites are independent from each others [45] . The PWM method continues to be applied and extended. Log2PWMs is simple implementation of PWM extended to enable conversion or reconstruction of PWM representation from sequence logo [47] . The PWM is also used for the binding specificity of a transcription factor (TF) [48] . It can be used to scan a sequence for the presence of DNA words, which are comparatively more similar to the PWM than to the background [49, 50] . Authors in [51] evaluate the Bayesian network and support vector machine (SVM) algorithm on four distinct TF binding sites based data sets and analyze their performance using PWM. Authors in [52] develop a tree-based PWM algorithm to accurately simulate the interaction between TF and its binding sites. A new di-nucleotide PWM approach is proposed in [53] , which outperforms the conventional mono-nucleotide PWM method. Moreover, the research done in [54] proposes an improved position weight matrix (IPWM) method to recognize the prokaryotic promoters based on an entropy measurement. Using the Hepatitis C Virus (HCV) nucleotide sequences, authors in [55] designed a global PWM for the genotype of HCV genomes. Then using the PWM, signatures were selected from the 5' NCR, CORE, E1, and NS5B regions of the HCV genome. The predictive power of the selected signatures is then evaluated for predicting the most common HCV genotypes and subtypes. Aside from DNA analysis, the PWM can also be applied to amino acid sequences. Authors in [56] develop an approach involving position specific scoring matrix (PSSM), another name of PWM, to predict protein-protein interactions between protein sequences. They first transform each protein into a PSSM, and then adopt the PSSMs to detect distantly related proteins as well as the protein quaternary structural attributes and protein folding patterns. The research of in [57] proposes a PWM-based algorithm to predict signal peptide sequences and their cleavage positions in the amino acid sequences of bacteria and eukaryotes. Authors in [58] develop a PWM-based method for protein function prediction and propose an argument for why the PWM and associated features have great potential for protein sequence analysis. Although the above methods are successful in respective domains, they do not propose a general method to design a feature embedding for the underlying sequence, which contains rich information about the sequence and can be used as input to various machine learning algorithm. Designing efficient feature vector based representations haave been studied in many domains such as graph analytics [59, 60] , smart grid [61, 62] , electromyography (EMG) [63] , clinical data analysis [64] , network security [65] , and text classification [66] . After the spread of COVID-19, efforts have been made to study the behavior of the virus using machine learning approaches. Several methods have been proposed recently for the classification of spike sequences. Authors in [31, 67] uses k-mers along with kernel based approach to classify the spike sequences. Authors in [30] propose the use of one-hot encoding to classify the viral hosts of coronavirus using spike sequences only. Although they were able to achieve higher predictive performance, authors in [31] shows that k-mers based approach performs better than the one-hot since it preserves sequence order information more effectively. An efficient clustering of the spike sequences is done in [33] . In this section, we propose an approach, PWM2Vec, to generate a fixed-length numerical feature embedding from coronavirus spike sequences for the purposes of host-specification. We also discuss the baseline approaches, specifically one-hot embedding (OHE) [30, 32] and k-mers based feature embedding [31, 32] . We perform feature selection using ridge regression [68] on the resulting embedding before applying machine learning (ML) algorithms. This helps to reduce the dimensionality of the embedding, hence the training time of the downstream ML algorithms. Machine learning algorithms requires the input to be in a numerical format, it is necessary to processes the (spike protein) sequence data into some numerical representation to apply these algorithms. One-hot embedding (OHE) [30] is a typical approach for obtaining a fixed-length numerical representation from sequence data. Considering an alphabet Σ, which contains the characters (amino acids) of the spike protein sequence, we need to map each character of Σ to a numerical (binary 0-1 vector) representation. We have 24 unique amino acids in the protein sequence data, namely, "ABCDEFGHIJKLMNPQRSTVWXYZ". For each character in the protein sequence, we design a feature vector wherein we encode each symbol separately. Each symbol has length 24 and has a value of 1 at the location corresponding to the position of the character in the alphabet, and 0 at all other places in the vector. For example, amino acid Cysteine (C) is encoded as 001 . . . 0. We then concatenate the numerical representations all characters of the protein sequence into a single binary feature vector of this spike sequence. In our coronavirus spike protein sequence dataset, after multiple alignment, the length of each spike sequence is 3498. Therefore, the length of each binary vector computed using OHE would be 3498 × 24 = 83, 952. One of the major drawbacks of OHE is the high dimensionality of the resulting set of feature vectors. Another problem with OHE is that some (sequential) ordering information on the characters (amino acids) of the sequence is not preserved. An approach which addresses both of these problems is to use sub-strings (also called mers) of length k, i.e., k-mers. From a sequence, k-mers are generated by applying a sliding window of size k over the sequences (see Figure 2 ). Given a sequence of length N, the total number of k-mers that could be generated is as follows: (1) In our experiments to generate k-mers based frequency vectors, we use k = 3 (as done in [31, 32] ). After generating the k-mers, we create a feature vector Φ (a frequency vector), which contains the frequency (count) of each k-mer occurring in the sequence [31, 32] . Given some sequence σ on alphabets Σ, the length of feature vector Φ k (σ) will be |Σ| k . Since we are working with spike protein (amino acid) sequences, and taking k = 3 in our experiments, the length of the feature vector we use is 24 3 = 13824. This feature vector can be used as input for various ML algorithms. Note that generating sub-strings with a sliding window preserves some (sequential) ordering information on the characters (amino acids) of the (spike protein) sequences, which counters one of the drawbacks of the OHE approach. But still to get the frequency count for k-mers we got a problem of high computational cost for bin matching, especially for worst case searches. Also the dimensionality of the frequency vectors is still high. Despite the fact that the problem of high dimensionality of the vectors generated in the OHE approach is somewhat mitigated in the k-mers approach, the dimensionality of the frequency vectors generated in the k-mers approach remains quite high -further improvements can certainly be made. Also, if we can reduce the computational cost for bin matching that would be huge improvement in computational cost. To address these problems, we propose PWM2Vec, an approach for generating a fixed-length numerical feature vector based on the concept of the position-weight matrix [69] . While our approach is inspired by the value of the PWM for finding motifs in (e.g., protein) sequences, we use it in a slightly different way here. Here, we build a PWM from the k-mers of our sequence, and our feature vector is the score of each k-mer from this PWM. This allows us to take advantage of k-mers in their ability to capture locality information, while also capturing the importance of the position of each amino acid in the sequence (information that is lost in computing k-mer frequency vector). Combining these pieces of information in this way allows us to devise a compact and general feature embedding that can be used with many downstream ML tasks. Our approach PWM2Vec for feature vector generation, is summarized in Figure 3 , we follow the steps (a-h) as explained below. Figure 3 (a) Given an input spike protein sequence , Figure 3 (b) we first extract the k-mers (we use k = 9 in the experiments, which is decided using a standard validation set approach [70] ). Figure 3 (c) We then generate a position frequency matrix , which contains the frequency count for each character at each position. Note that, in the example, since the (amino acid) sequence is composed of four characters, there are four possible characters at any position. At position 1, for example, in all 5 k-mers, there are 2 B characters, and so the frequency count of B at position 1 is 2. In our experiments, since we have 24 unique amino acids in our spike protein sequence dataset, our PFMs will have 24 rows and k = 9 columns. Figure 3 (d) Next, we normalize the PFM matrix, and create a position probability matrix (PPM) containing the probability of each (amino acid) character at each position. For example, the probability of B in the k-mers at position 1 is the following: frequency count total count = 2/5 = 0.4 It is possible that the frequency (hence probability in the PPM) of a character at a certain position is 0. In order to avoid 0 values at any position in the matrix while calculating the probability, we add a Laplace estimator, or a pseudocount to each value in the position probability matrix as shown in Figure 3 (e). We use a pseudocount of 0.1 in our experiments [71] . We then compute a position weight matrix (PWM) from the adjusted probability matrix. We make the PWM by computing the log likelihood of each amino acid character c, i.e., c ∈ A, B, C, . . . , Z, appearing at each position i according to Note that this likelihood is taken under the assumption that the expected frequency of each amino acid is the same (i.e., p(c) = 1/|Σ|). since we have 24 bases amino acids(p(c) = 1/24 = 0.041). Figure 3 (f) shows the computed position weight matrix (PWM). After getting the PWM, we use it to compute the absolute scores for each individual k-mer generated from the sequence (see Figure 3 (g) for an example). It is sum of score of bases for the index. Calculating the absolute score is shown in Figure 4 for k-mer (BFDBEDDFF). The highlighted values in matrices are summed up to give Absolute score for the k-mer which sums up to 28.28. Finally, the scores of all the 9-mers are concatenated to get the final feature vector for the given sequence (see Figure 3 (h) for an example). This whole process is repeated for each sequence. Given a k-mer and PWM, the score for that k-mer can be computed as given in Figure 4 . The final length of PWM2Vec based feature vector is 3490, which is equal to the number of k-mers in each spike sequence. We use Ridge Regression (RR) as the feature selection approach, which is a commonly used for estimating parameters thereby addressing collinearity in multiple linear regression model problems [72, 73] . This method uses a bias to boost the performance of the model by improving the variance and making the slope more horizontal like. This is useful when we need to find out which of the independent attributes are not needed. This gives us the option of removing such columns (attributes) and bring the slope to zero (see Figure 5 ). The expression for performing ridge regression is the following: where α × slope 2 is a penalty term. The total number of selected attributes after performing RR in OHE is 22322, for k-mers approach is 7088 while 1616 features are selected for the PWM2Vec based approach. In this section, we describe the setup we used for the experiments followed by the dataset statistics. We also give visual representation of the data using t-SNE plots. All experiments are conducted using an Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.40GHz having Windows 10 64 bit OS with 32 GB memory. We Implemented our algorithm in Python and the code is available online for reproducibility 1 To measure the performance of the classifiers, we use average accuracy, precision, recall, weighted and macro F1, and receiver operating characteristic curve "ROC" area under the curve "AUC" (one-vs-rest approach). We also measure the training time (in seconds) for each of the classifier. For clustering, we use simple k-means algorithm and use 3 internal clustering quality metrics namely Silhouette Coefficient, Calinski-Harabasz Score, and Davies Bouldin Score to measure the performance of the clusters. We also show the runtime for k-means for different embedding methods. The silhouette coefficient [74] is used for interpretation and validation of consistency within clusters of data. A clustering algorithm having well-defined (comparatively pure) clusters will have a higher silhouette coefficient value. The Silhouette Coefficient (SC) is computed as follows: where x is the average distance between a sample and all other points in the data belonging to the same class and y is the average distance between sample x and all other data points in the next nearest cluster. The Calinski-Harabasz score [75] is used to measure the quality of a clustering algorithm based on the mean between and inside cluster sum of squares. A clustering algorithm with well defined clusters will have a higher Calinski-Harabasz score. The Calinski-Harabasz score is defined as the ratio of the between-clusters dispersion (the sum of distances squared) mean and the within-clusters dispersion. More 1 https://github.com/sarwanpasha/PWM2Vec 2 Available in published version formally, given a dataset D of size n D that has been clustered into j clusters, we use following expression to compute Calinski-Harabasz (CH) score: where tr(B j ) is the trace of the between cluster dispersion matrix, tr(W j ) is the trace of the within-group dispersion matrix. The Davies-Bouldin (DB) score [76] of a clustering C is defined as follows: where D ij is the ratio of the "within-to-between cluster distance" of the ith and jth clusters. For each cluster, we compute the worst case ratio (D ij ) of a within-to-between cluster distance between it and any other cluster, and then take the mean. Therefore, by minimizing the DB score, we can make sure that different clusters are separate from each other (smaller value is better). The Spike protein sequences of CoVs for all hosts used in this analysis were retrieved (on September 21, 2021) from the NIAD Virus Pathogen Database and Analysis Resource (ViPR) [77] and GISAID [24] . A total of 5568 complete protein sequence were collected (3358 from ViPR and 2210 from GISAID), later dropping 10 not attributable to any host detail. The distribution of the dataset across the different host types (grouped by family) is shown in Table 1 , which contains information about the 21 host types that we collected from the annotation of the sequences. We also divided the viral sequences themselves into genus and subgenus to see which category a specific coronavirus belongs to. Figure 6 and Figure 7 contains distributions of viral genus and subgenus, respectively. The multiple sequence alignment (MSA) for the sequence dataset was conducted using Mafft Alignment software with default parameter settings which automatically select the appropriate strategy according to the sequence data size. In our case, the gap opening penalty op was 1.53 and gap extension penalty ep was 0.123 [78] . Given that our dataset was already sufficiently large, and contained a number of unknown or identified amino acids, we were constrained to using the minimum accuracy parameter of Mafft MSA in order to allow the alignment to complete in a reasonable amount of time. Attempts to set more stringent parameters (op and ep) in order to improve this alignment resulted in runtimes > 24 hours. Since this is already the case when performing multiple alignment of even 5000 sequences, anything substantially larger is out of reach. In order to see if there is any natural (hidden) clustering in the data, we used t-distributed stochastic neighbor embedding (t-SNE) [79] approach, which maps input sequences to 2D real vectors. The t-SNE plots for different embedding methods are shown in Figure 8 , Figure 9 , and Figure 10 contains the t-SNE plots for OHE, k-mers, and PWM2Vec, respectively. We can observe that although with PWM2Vec, more information is included in less dimensional feature vectors, the proposed embedding approach was able to preserve the structure of data similar to OHE and k-mers. In this section, we present our results for PWM2Vec and compare its performance with the baseline One-Hot Embedding (OHE) and the more recent k-mers based embedding approach which has shown to be an improvement over OHE [31, 32] . For classification, we also show the results for feature selection method (ridge regression) for all embedding approaches. We also show the effect of runtime with increasing number of sequences for the best performing classification algorithm. In the end, we show some statistical analysis on the data as well as on the feature vectors computed using different approaches. Table 2 shows the results for different embedding methods with different classification methods without performing any feature selection on the feature vectors. We can see that RF with PWM2Vec is consistently performing better than other embedding methods (in some cases the performance gain for PWM2Vec on third significant digit). Also in terms of training runtime, the NB classifier with PWM2Vec is much better than other approaches. This behavior shows that PWM2Vec is not only better in terms of predictive performance, but is also better in terms of runtime. Table 3 contains the classification results after applying ridge regression classifier of different feature embedding approaches. We can again see that RF with PWM2Vec outperforms other embedding methods for majority of the metrics and NB with PWM2Vec is best in terms of runtime. To evaluate the effect of runtime on the sequences, we use the best performing classifier (Random Forest) and use different embedding methods to perform classification with increasing number of sequences. Figure 11 shows the effect of runtime for (a) OHE vs PWM2Vec and (b) k-mers vs PWM2Vec. We can see that in both cases, PWM2Vec is better in terms of runtime as we increase the number of sequences (on x-axis). For clustering, we use the same feature embeddings that we use for classification task. To get the optimal number of clusters, we used the Elbow method [33] . This method for different number of clusters (ranging from 2 to 30) is performing clustering to see the trade-off between the runtime and the sum of squared error (distortion score). The optimal number of clusters selected are 9 (see Figure 12 ). Table 4 . We can see that PWM2Vec is better in terms of Silhouette Coefficient and runtime while k-mers based approach is better in terms of Calinski-Harabasz Score and Davies-Bouldin Score. To measure the importance of amino acids corresponding to the class label, we use the information gain (IG). The IG is defined as follows: where H is the entropy, and p i is the probability of the class i. Figure 13 shows the IG values for different amino acids corresponding to the class labels (hosts). We can see that some amino acids have higher IG values, which means that they play an important role towards the prediction of hosts. Here we can conclude that many amino acids contribute to the host specification as compared to that for SARS-CoV-2 variants specification [31] , which is expected, since the genomic variability within the family coronavidiridae should be much higher. The IG values for all amino acids are also available online for further analysis 3 . Since information gain does not give us the negative (or opposite) contribution of an attribute (feature) corresponding to the class label (host names), we use other statistical measures such as Pearson correlation [80] and Spearman correlation [81] to further evaluate the contribution of features in PWM2Vec based feature vector. The Pearson Correlation is computed using the following expression: where r is the correlation coefficient, x i is the values of the x-variable in a sample,x is the mean of the values of the x-variable, y i is the values of the y-variable in a sample, andȳ is the mean of the values of the y-variable. The Spearman Correlation is computed using the following expression: where ρ is the Spearman's rank correlation coefficient, d i is the difference between the two ranks of each observation, and nis the total number of observations. The Pearson correlation and corresponding P-values for PWM2Vec are given in Figure 14 while the Spearman correlation and corresponding P-values for PWM2Vec are given in Figure 15 . We can observe that most of the features are contributing towards the prediction of different hosts. Table 5 contains the comparison of correlation values computed using Pearson correlation and Spearman correlation for different embedding methods. Here, we can observe that with less dimensional and more compact approach for feature embedding (PWM2Vec), the fraction of features having correlation values greater than the threshold (i.e. 0.3 and −0.3) are greater than those given for OHE and comparable with those given for k-mers (sometimes better also). This behavior shows that using PWM2Vec, we are able to preserve more information in a smaller feature vector and improve the runtime of underlying ML algorithms while giving better (sometimes comparable) predictive performance. 0 1,000 2,000 3,000 −0.5 We propose an approach, called PWM2Vec to generate feature vector representation for different hosts of different coronaviruses using spike sequences only, which can be used as input for different machine learning algorithms such as classification and clustering. We show that our approach is not only efficient to generate feature vectors as compared to the popular method based on k-mers, but has comparable prediction accuracies and much better training runtime. This behavior is also observed after applying feature selection algorithm. We also provided some statistical analysis on the data and feature vectors to show the importance of attributes towards the prediction of class labels (hosts). This statistical analysis provides a validation, from an independent point of view (in terms of fraction of features statistically correlated to the label), of the compactness of our PWM2Vec embedding, compared to the baselines. In the future, we would focus on collecting more data to evaluate the scalability of the PWM2Vec. Using unsupervised methods for dimensionality reduction is another future extension of this work. We would also like to use deep learning models such as LSTM and GRU for the classification purposes in the future. The application of this to larger families of viruses could also be another interesting future direction. Recent Developments on Therapeutic and Diagnostic Approaches for COVID-19 A pneumonia outbreak associated with a new coronavirus of probable bat origin The Late Miocene Radiation of Modern Felidae: A Genetic Assessment Dromedary camels and Middle East respiratory syndrome: MERS coronavirus in the ship of the desert Virus Taxonomy COVID-19 diagnosis-A review of current methods Animal origins of the severe acute respiratory syndrome coronavirus: insight from ACE2-S-protein interactions Structure, function, and evolution of coronavirus spike proteins. Annual review of virology The increasing importance of the novel Coronavirus The molecular biology of SARS coronavirus Asymptomatic and human-to-human transmission of SARS-CoV-2 in a 2-family cluster Clinical characteristics of coronavirus disease 2019 in China Preventing the next pandemic -Zoonotic diseases and how to break the chain of transmission report United Nations 2020 COVID-19-zoonosis or emerging infectious disease? Frontiers in Public Health COVID-19 transmission, current treatment, and future therapeutic strategies Pangolins harbor SARS-CoV-2-related coronaviruses Epidemiological Study of Betacoronaviruses in Captive Malayan Pangolins What does plant-based vaccine technology offer to the fight against COVID-19? Vaccines Retargeting of coronavirus by substitution of the spike glycoprotein ectodomain: crossing the host cell species barrier Recombinant avian infectious bronchitis virus expressing a heterologous spike gene demonstrates that the spike protein is a determinant of cell tropism Nextstrain: real-time tracking of pathogen evolution IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era Does a tree-like phylogeny only exist at the tips in the prokaryotes? Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees a tool for approximated maximum likelihood supertree reconstruction Phylogenetic supertree reveals detailed evolution of SARS-CoV-2 From Alpha to Zeta: Identifying variants and subtypes of SARS-CoV-2 via clustering Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone A k-mer based approach for sars-cov-2 variant identification Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences Effective and scalable clustering of SARS-CoV-2 sequences Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants Discovering Discriminative K-mers from Sequencing Data for Metagenomic Reads Classification An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes Towards end-to-end disease prediction from raw metagenomic data Kraken: ultrafast metagenomic sequence classification using exact alignments Classification of Metagenomes Using k-mers PWMScan: a fast tool for scanning entire genomes with a position-specific weight matrix Psi-blast tutorial Fitting a mixture model by expectation maximization to discover motifs in bipolymers Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment Position weight matrix, gibbs sampler, and the associated significance tests in motif characterization and prediction Review of different sequence motif finding algorithms YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation Logo2PWM: a tool to convert sequence logo to position weight matrix A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites DNA binding sites: representation and discovery Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells Occupancy classification of position weight matrix-inferred transcription factor binding sites Tree-based position weight matrix approach to model transcription factor binding site profiles Optimizing the GATA-3 position weight matrix to improve the identification of novel binding sites An improved position weight matrix method based on an entropy measure for the recognition of prokaryotic promoters HCV genotyping using statistical classification approach Predicting protein-protein interactions from protein sequences by a stacked sparse autoencoder deep neural network PrediSi: prediction of signal peptides and their cleavage positions On position-specific scoring matrix for protein function prediction Predicting attributes of nodes using network structure Combinatorial trace method for network immunization Short term load forecasting using smart meter data Short-term load forecasting using AMI data Effect of Analysis Window and Feature Selection on Classification of Hand Movements Using EMG Signal Efficient Analysis of COVID-19 Clinical Data using Machine Learning Models Detecting ddos attack on sdn due to vulnerabilities in openflow Language independent sentiment analysis Classifying COVID-19 Spike Sequences from Geographic Location Using Deep Learning Ridge regression: some simulations Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli Pattern Recognition: A Statistical Approach Pseudocounts for transcription factor binding sites Ridge regression Simpler and faster development of tumor phylogeny pipelines Silhouettes: a graphical aid to the interpretation and validation of cluster analysis A dendrite method for cluster analysis A cluster separation measure ViPR: an open bioinformatics database and analysis resource for virology research MAFFT-DASH: integrated protein sequence and structural alignment Visualizing data using t-SNE Pearson correlation coefficient. In Noise reduction in speech processing Spearman correlation coefficients, differences between. Encyclopedia of statistical sciences © 2022 by the authors. Submitted to Journal Not Specified for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license