key: cord-0923586-px1ioxqw
authors: Singh, Om Prakash; Vallejo, Marta; El-Badawy, Ismail M.; Aysha, Ali; Madhanagopal, Jagannathan; Mohd Faudzi, Ahmad Athif
title: Classification of SARS-CoV-2 and Non-SARS-CoV-2 Using Machine Learning Algorithms
date: 2021-07-21
journal: Comput Biol Med
DOI: 10.1016/j.compbiomed.2021.104650
sha: cfdf0ed60561a63d53012865692d2dd0d0779c37
doc_id: 923586
cord_uid: px1ioxqw

Due to the continued evolution of the SARS-CoV-2 pandemic, researchers worldwide are working to mitigate, suppress its spread, and better understand it by deploying digital signal processing (DSP) and machine learning approaches. This study presents an alignment-free approach to classify the SARS-CoV-2 using complementary DNA, which is DNA synthesized from the single-stranded RNA virus. Herein, a total of 1582 samples, with different lengths of genome sequences from different regions, were collected from various data sources and divided into a SARS-CoV-2 and a non-SARS-CoV-2 group. We extracted eight biomarkers based on three-base periodicity, using DSP techniques, and ranked those based on a filter-based feature selection. The ranked biomarkers were fed into k-nearest neighbor, support vector machines, decision trees, and random forest classifiers for the classification of SARS-CoV-2 from other coronaviruses. The training dataset was used to test the performance of the classifiers based on accuracy and F-measure via 10-fold cross-validation. Kappa-scores were estimated to check the influence of unbalanced data. Further, 10x10 cross-validation paired t-test was utilized to test the best model with unseen data. Random forest was elected as the best model, differentiating the SARS-CoV-2 coronavirus from other coronaviruses and a control a group with an accuracy of 97.4%, sensitivity of 96.2%, and specificity of 98.2%, when tested with unseen samples. Moreover, the proposed algorithm was computationally efficient, taking only 0.31 seconds to compute the genome biomarkers, outperforming previous studies.

proteins. In addition, nine major sub-genomic RNAs are produced, see Figure 1 , which are translated into accessories proteins of SARS-CoV-2. Figure 1 . SARS-CoV-2 genome organization [1] and ordering/location of the various encoded proteins.

Symptoms of COVID19 identified to date include fever, cough, myalgia, headache, shortness of breath, chills, sore throat, runny nose, chest pain, rash, nausea, vomiting, diarrhea, and fatigue.

Since many of the symptoms resemble those of the common cold and influenza, an accurate molecular result is critical for a final diagnosis. The real-time polymerase chain reaction is a wellknown molecular method [8] but has suffered from a high false-negative rate and (30-50%) detection rate [6, 9] . Due to the variation of viral RNA sequences within virial species (see Figure   1 ), and the viral load in various anatomic sites [10] . In addition, COVID19 assays can result in low sensitivity if not aligned properly with the virus template, as the virus is strongly related to other coronavirus species. Moreover, SARS-CoV-2 may present with other lung infections that makes it even more challenging to identify [11] . Thus, researchers worldwide applied various digital signal processing (DSP) methods such as discrete Fourier transform (DFT) [12] , digital filter [13] , time-domain periodogram (TDP) [14] , modified average magnitude difference function (AMDF) [15] , singular value decomposition (SVD) [16] and modified SVD [17] , which include forward-backward filtering to detect three-base periodicity (or period-3 property) for the prediction of exon locations in the DNA sequence [12] .

These methods could be potentially useful in suppressing the SARS-CoV-2 spread by discrimination other coronaviruses.

Three-base periodicity is an intrinsic property of protein-coding regions (known as exons) of DNA, [12] It can be used to distinguish protein-coding sequences from non-coding sequences (known as introns) that do not show the same periodicity. Figure 2 shows that the Fourier spectrum of DNA sequences (SARS-CoV-2 isolate Wuhan-Hu-1) exhibit a strong spectral component at frequency Numerous studies report the distinction of the virus using various classification approaches such as support vector machine (SVM) [18] , decision trees (DT) [19] , Gaussian radial basis function neural network [20] , random forest (RF) [21] , gapped Markov Chain with SVM [22] , k-nearest neighbor (k-NN) [23] , and convolutional neural network [24] . Besides, k-mers (oligomers of length k) based SVM, ML-DSP, and MLDSP have been utilized effectively in virology, including HIV-1 genomes, influenza, dengue and COVID19 classification [25] [26] [27] . However, k-mers do not work well with short length sequences, and the use of higher k-mer, exponentially increases the number of features, which poses a significant computational challenge. Further, there can be less tolerance when proteins contain errors or mutations as k-mers must be contiguous. Lopez-Rincon et al. [28] proposed a deep learning approach for the classification of SARS-CoV-2, with specificity J o u r n a l P r e -p r o o f 99.39% and sensitivity 100%. However, disadvantages of all deep learning methods are the lack of interpretability and being computation expensive and prone to overfitting. We employed the electron-ion interaction potential (EIIP) [20] scheme for the numerical representation of complementary (cDNA), as a simple way of enumerating the four different cDNA bases, hereafter referred as DNA, three-base periodicity property to extract the genome biomarkers, and ML models to classify SARS-CoV-2. This work illustrates how DSP, biomarker selection, and ML give a computationally rapid alignment-free classification of novel coronavirus. Herein, the converted DNA sequence into genomic signal was used for the computation of the magnitude spectrum and its average by applying DFT, in addition to the peak-to-average ratio of the magnitude spectrum as biomarkers.

Further, AMDF, SVD, and TDP were utilized as biomarkers with zero-phase filtering, in contrast to traditional filters [13] [14] [15] [16] and without filtering. Filter-based Pearson correlation coefficient (PCC) via ANOVA tests, and correlation-based feature section (CFS) were employed to identify the most significant biomarkers for the classification of SARS-CoV-2. Filter techniques can easily scale to large datasets, are computationally simple, fast, and independent from the classifier. Following this, the ranked biomarkers were fed into the k-NN, DT, RF, and SVM classifiers. The proposed method is efficient, computationally inexpensive, and able to correctly distinguish the SARS-CoV-2 from non-SARS-CoV-2, which includes the rest of coronavirus and a control group without a priori biological knowledge.

The proposed method for the distinction of SAR-CoV-2 comprises five steps: (1) Data collection:

a total of 1582 samples, including 615 SARS-CoV-2 and 967 non-SARS-CoV-2 samples, (2) Conversion of the DNA "characters" into numeric values for rapid and more efficient processing,

(3) Three-base periodicity property detection for the extraction of genome biomarkers using DSP, (4) Biomarkers selection, and (5) ML implementation. Eight biomarkers were derived using the three-base periodicity property and selected based on PCC [29] , and CFS [30] , as some of the biomarkers may contain irrelevant and/or redundant information that may reduce the performance of the classifier if not removed [21] . The selected biomarkers were fed into the classifiers.

COVID19 Wuhan-Hu-1 whole reference genome of 29903bps was downloaded from the National [27] . COVID19 data comprises the complete genome, the complete coding sequence (CDS), and partial CDS, which length varies from 64 to 29945bps.

Besides, other human, mammals, and birds' coronaviruses [31] , were incorporated into the non-SARS-CoV-2 group to assess the robustness and effectiveness of the proposed algorithms. We also J o u r n a l P r e -p r o o f downloaded a control sample from the Epitranscriptomics and RNA Dynamics Lab (Novoa Lab) and the Bioinformatics Core Facility (BioCore) at the Centre for Genome Regulation [32] . Supplementary. Methodology for the extraction of genome biomarkers using the three-base periodicity property is elucidated in the following sections. Table 1 lists the name of the coronavirus species, number of samples, and designated labels.

Herein, a robust DSP algorithm is developed to investigate the strength of the three-base periodicity to extract the significant genome. We selected eight biomarkers using various DSPbased methods, namely average magnitude spectrum, peak-to-average ratio of the magnitude spectrum, SVD, SVD with filtering, AMDF, AMDF with filtering, TDP, and TDP with filtering. [31] 4 0 PREDICT_CoV-47 [31] 2 0 PREDICT_CoV-82 [31] 3 0 PREDICT_CoV-92 [31] 36 0 PREDICT_CoV-93 [31] 3 0 PREDICT_CoV-96 [31] 5 0 bat-SL-CoVZC45

J o u r n a l P r e -p r o o f Figure 3 exhibits the procedure for the computation of three-base periodicity detection using discrete Fourier transform. Herein, frequency-domain representation (spectrum) refers to breaking down a signal into its constituent sinusoids. That is, the spectrum of a signal is a representation of its frequency content [33] . Considering a zero-mean genomic signal 'y(n)', of length 'N', as in

where µx is the mean of the signal 'x(n)', calculated as follows,

The purpose of subtracting the mean is to suppress the zero-frequency component [34] , since direct current component is not significant in the context of detecting the three-base periodicity. The magnitude spectrum of 'y(n)' is computed using the DFT as depicted in Eqs. (3) - (5) .

where 'k' is the frequency 'f' and the sampling frequency 'fs' of the signal.

When dealing with DNA sequences, the value 'fs' uses one sample per second [35] and thereby,

The magnitude spectrum is, then, normalized as follows in Eq. (6) .

This normalization examines the strength of the 1/3 frequency component relative to the whole magnitude spectrum. Thus, the average of the normalized magnitude spectrum is estimated using Eq. (7): to N/3 as performed in [35] . Then, computing the ratio between the magnitude of this 1/3 spectral component and its average is performed using Eq. (8):

This feature is expected to be relatively higher if the DNA is a protein-coding sequence (e.g. viral genome), as the nucleotides exhibit three-base periodicity. Figure 4 illustrates the steps to estimate the AMDF, SVD, and TDP. Herein, the genomic signal

x(n) is filtered to emphases the three-base periodicity and point out the protein-coding region in the DNA sequence employing conventional filtering methods [12, 33] , see Figure 5 . However, we employed zero-phase filtering, instead of traditional filtering, to overcome the non-linear phase distortion. In addition, we investigated the impact of AMDF, SVD, and TDP approaches without filtering, which may enhance the computational efficiency of the proposed algorithms. See the supplementary material for the illustration of anti-notch filter and the mathematical description of AMDF, SVD, and TDP. 

Our main aim is to employ the minimum number of best biomarkers to maximize the performance of each classifier for the problem in consideration. Herein, filter-based biomarker selection is employed, instead of wrapper approaches since they compute the relevance of biomarkers by their correlation with the dependent class. On the other hand, wrapper techniques measure the effectiveness of a subset of biomarkers by training a classifier via cross-validation (CV) [36] , limiting the use of more than one classifier at a time. Moreover, filter-based techniques reduce the risk of overfitting, the computation cost, and the selection is independent of any classifier [37] [38] .

Herein, two filter techniques were deployed: PCC and CFS [39] . The explanations about the PCC J o u r n a l P r e -p r o o f and CFS are included in the Supplementary. Further, the probability density functions (PDFs) are computed using the kernel density estimation method for the best biomarker to provide a qualitative assessment between the SARS-CoV-2 and non-SARS-CoV-2 groups [40] .

Herein, k-NN, DT, RF, and SVM were used for the classification of SARS-CoV-2 and non-SARS-CoV-2. ML techniques have proven to be powerful tools for addressing such tasks [25] [26] [27] [28] . ML refers to a series of algorithms driving their functionality learning from unlabeled or labeled data, rather than using predefined sets of functions and rules [41] . This property is ideal to predict histopathological characteristics, clinical outcomes, molecular biomarkers, or treatment responses [41] . To augment generalizability, and limit overfitting, ML includes training, validation, and external testing in separate datasets [41] . CV uses the training and validation datasets to fit the classifier, evaluate its performance, and optimize its hyper parameters [41] based on arbitrary subseparation and iterative cycles of training and validation [42] . The testing dataset is kept completely independent from development and is utilized to assess the final model and verify its performance and generalizability. Short descriptions about k-NN, DT, RF, and SVM are included in the following paragraphs.

k-NN has widely been used due to its simple implementation and high efficiency [26] . It is a very versatile algorithm since it can be applied to classification, regression, and missing value imputation problems. The key idea of the standard k-NN is to search for all the K nearest neighbors for a given test sample. The two main elements that affect its performance is the selection of a proper K value and selecting the best distance function for identifying the K classes.

DT is one of the most well-known machine learning methods for data classification. It is a treebased technique in which the model is represented as a set of nodes and hierarchical connections J o u r n a l P r e -p r o o f that represents relationships. The connections form a path that starts from a root node, and it is described by a sequence where data is recursively separated until reaching a Boolean outcome in a leaf node. DTs are considered a powerful method in terms of accuracy, simple analysis, predictive power, and fast convergence [27] .

RF [25] is based on the idea that aggregating multiple decision trees cause a decrease of variance In this study, the total number of samples (1582 samples) was divided into two sets via resampling.

The first set consists of 70% (1107 samples) from both classes SARS-CoV-2 (424 samples) and

non-SARS-CoV-2 (683 samples), which were used to train the model via ten-fold CV. The second set contained 30% (475 samples) from both classes SARS-CoV-2 (189 samples) and non-SARS-CoV-2 (286 samples) and was used as a testing dataset with the trained model. The classifiers were validated via ten-fold CV, and the best model was selected based on the F-measure [43] , which balances the recall and precision of the model. In addition, the efficacy of the model was evaluated on the accuracy matrix via the corrected 10x10 fold CV paired t-test [44] . This method compares the means of two groups of compatible data, determining which one is lower or whether they are equivalent, prior to apply the test dataset in the trained model. Moreover, Kappa-score was utilized J o u r n a l P r e -p r o o f to verify the influence of imbalance data between the two classes as one class (non-SARS-CoV-2) comprises 61.15% compared with the SARS-CoV-2 class 38.85%.

In this study, the DT, and RF are deployed with default parameters, while k-NN uses K=3, and SVM uses radial basis function (RBF) kernel that was implemented by the C++ LIBSVM library [45] . The hyper parameters of RBF (penalty constant, C, kernel width, γ) were optimized by a grid-search to achieve the maximum result. All four ML experiments provided in this study were conducted using WEKA [44] .

The performance of the trained model on the testing dataset was evaluated using confusion matrices (refer Supplementary) in terms of sensitivity, specificity, and accuracy [46] . Herein, sensitivity assesses if the SARS-CoV-2 data is correctly recognized by the classifier, whereas specificity reveals how well the non-SARS-CoV-2 data was identified. The accuracy assesses the total amount of samples that were well classified.

We employed a DSP-ML-based algorithm for the classification of SARS-CoV-2 and non-SARS-CoV-2. A total of eight biomarkers were extracted based on the three-base periodicity property.

Supplement Figure 1 provides the results of three-base periodicity property using DFT in terms of frequency and magnitude spectrum for one sample of each class, that shows the variation in magnitude spectrum to the corresponding frequency at 1/3Hz. The algorithms source code is in Table 1 lists the investigated genome biomarkers with the respective sequence length statistics, mean, and standard deviation (SD) for SARS-CoV-2 and non- Table   1 ), which shows the consistency of the proposed biomarkers and provide kind of surety that these biomarkers will perform similarly even with a greater number of samples. Table 2 depicts the result J o u r n a l P r e -p r o o f of the genome biomarkers selection. The biomarker (GB5) had an "r" value of 0.32, followed by GB3, GB2, GB1, and GB4 with "r" values 0.18, 0.12, 0.13, 0.12, respectively, which illustrates the weak positive correlation compared with the rest of biomarkers, which had no relation (r<0.1).

Further, ANOVA test was performed, confirming that GB1 to GB5 are significantly better (p<0.0001) than GB6, GB7, and GB8, which is consistent with the F-value listed in Table 3 . Hence, we used for the classification a set of five biomarkers, removing those biomarkers, with no relation (r<0.1). CFS method utilized the subsets of biomarkers that are highly linked with the class, while having low intercorrelation. Table 2 shows that biomarkers GB4 and GB5 possess higher discrimination capabilities with a merit of 0.67. Thus, both were used as input biomarker vectors.

Besides, PDF of GB5 and scatter plot of GB4 and GB5 for 100 samples from both classes are presented in Supplementary Figure 2 and 3 .

Herein, the SVM-RBF classifier was optimized via Grid search and evaluated using ten-fold CV on the training dataset. The performance of the classifier was assessed based on the accuracy. Table 4 shows, for CFS and correlation, the values of the optimized parameters-penalty constant (C) and width (γ), were 100, 0.001 and 1000, 0.001. Further, the optimized parameters and same training dataset were used for comparison with other classifiers, while assessing the performance of the SVM. Table 4 presents the results of the classifiers via ten-fold CV in terms of the mean and SD, which utilized the CFS and correlation-based ranked biomarkers. The latter performed better than CFS, achieving slightly higher accuracy. The RF shows slightly greater F-measure (mean, 98% and SD, 2%) compared with other classifiers. This means that RF will possibly sustain acceptable precision and recall. In addition, Kappa-scores (>0.9) elucidates that the employed classifiers can well balance the disproportionate amount of data of both groups. However, the F-measure of SVM-J o u r n a l P r e -p r o o f RBF, DT, and RF stands close to the k-NN, see Table 4 . Hence, prior to apply the model on unseen data, a corrected 10x10 fold CV paired t-test was performed to assess the model performance based on accuracy. Table 5 depicts the results of the corrected 10x10 fold CV paired t-test. It can be seen that the RF holds victory (v), whereas k-NN, DT and RF neither contains asterisk (*) nor 'v', which shows that they could be statistically significant but unable to conclude via t-test. Therefore, RF was selected and deployed for further testing on the unseen data.

The selected RF model was trained and deployed for testing using the 30% unseen samples. Table   6 illustrates that the elected model can correctly classify the SARS-CoV-2 (sensitivity, 96.29%) and non-SARS-CoV-2 (specificity, 98.25%) with accuracy of 97.47%. Figure 6 proposed algorithm is computationally inexpensive and efficient to be implemented in a real-time scenario. 98 v (v/ /*) reflects as follows: v-victory, * -poorly statistically significant, blankunable to say Table 6 . Classification results using RF on unseen testing dataset.

SARS-CoV-2severe acute respiratory syndrome coronavirus 2; non-SARS-CoV-2-non severe acute respiratory syndrome coronavirus 2; RF-Random forest. 

Covid-19

Non-Covid-19

Non-Covid-19 Covid-19 J o u r n a l P r e -p r o o f

Based on the history of SARS-CoV-2, previous studies suggest an origin from bats earlier to zoonotic transmission [47] . So far, the early SARS-CoV-2 virus genomes, which are sequenced and uploaded are more than 99% similar, advocating these viruses result from a recent cross-species event [48] . These earlier examinations are based on alignment-based techniques to recognize relationships between the SARS-CoV-2 and other coronaviruses with amino acid sequence and nucleotide resemblances. When examining the reserve replicase domains of ORF1ab for coronavirus species categorization, almost 94% of amino acid residues were similar to SARS-CoV, reaching 70%, on the whole genome resemblance, which confirms that the SARS-CoV-2 virus was genetically distinct [49] . Within the RNA-dependent RNA polymerase (RdRp) zone, it was discovered that the bat coronavirus, RaTG13, formed via a different lineage from other bat SARSsimilar coronaviruses [48] , was the nearest relation to the SARS-CoV-2. A group of researchers found that two bat SARS-similar coronaviruses, bat-SL-CoVZXC21 and bat-SL-CoVZC45, were also very similar to SARS-CoV-2 [47] . Yet, whether the SARS-CoV-2 virus started from a recombination event is still unknown [48] .

We included distinct types of SARS-CoV-2 data including complete genome, partial genome, partial and complete CDS, from different regions such as RdRP, 3''-to-5'' exonuclease, nonstructural protein 3. The length of data, that varies from 64bps to 29945bps compared with 2000-50000bps from earlier studies [27] [28] , were included in this study, which shows the robustness of the proposed approach. Further, we proposed a new biomarker based on the three-base periodicity property for the prediction of SARS-CoV-2 virus.

In this work, eight biomarkers, GB1-GB8, were extracted based on the three-base periodicity properties, by applying various DSP techniques. Descriptive statistical analysis was performed to know the distribution of data, that helps to detect typos and outliers and allows us to identify associations among biomarkers. It can be seen (Supplementary Table 1 ) that there were minor deviations in the biomarker from their mean value, which were found to be distinct for both the SARS-CoV-2 and non-SARS-CoV-2 groups. Thus, the biomarker selection methods, PCC and CFS were deployed to enhance the efficiency of ML (Table 2 ). It can be observed that the results achieved by GB5 (AMDF) possess higher discrimination abilities for the classification with 32% of correlation coefficient. The outcome agrees with an earlier study [50] , wherein a correlation coefficient r ≥ 0.3 is suggested to be significant for medical diagnosis. Therefore, even the sequences collected from distinct zones with varied compositions can be simply compared quantitatively by employing the propose biomarker (AMDF), with uniformly meaningful results as when comparing SARS-CoV-2 sequences. Further, ANOVA test was applied to the biomarkers.

Correlation coefficients <0.3, and p-values <0.05 were assumed statistically significant [51] and were included as the most significant biomarkers. Table 3 shows that GB1 to GB8 had p-values <0.05. However, GB1 to GB6 reported lower p-values compared with GB7 and GB8. Hence, GB1-GB6 were taken as features for the classification. On the other hand, CFS based method revealed GB4 and GB5 as the most notable biomarkers compared with other biomarkers ( Table 3 ). The three-base periodicity DSP approach is simple and effective, which took an average 0.4 µseconds/nucleotide compared with k-mers, suggested in [27] . Further, the selected biomarkers from both methods were fed into the classifiers and assessed based on their accuracy. Table 4 illustrates that PCC and CFS were comparable as biomarker selection techniques.

However, PCC outperformed CFS as the accuracy was comparatively higher for all the classifiers.

Further, Kappa test was performed to confirm the influence of the imbalanced data between the different groups. It is shown ( Table 4 ) that all the classifiers achieved > 0.9 Kappa-score, which J o u r n a l P r e -p r o o f means that the results were not affected. Thereafter, the F-measure shows values closed to each other. Hence, 10-times 10-fold CV paired t-test was performed using the accuracy to identify the best model to test afterwards on unseen samples. It can be observed from Table 5 that the accuracy and F-measure achieved by k-NN, SVM-RBF, DT, and RF exhibited very close scores. However, paired t-test revealed that RF had the best replicability. Therefore, RF was chosen to be tested with the unseen data, achieving 96.29% sensitivity, 98.25% specificity with an accuracy of 97.47%, (Table 7) which are very near to the findings of the previous employed algorithms. Besides, it can be seen that the studies on SARS-CoV-2 based on k-mers and deep neural network (DNN) conducted by Randhawa et al. [27] and Lopez-Rincon et al. [28] achieved 100% and 98.73% accuracy, respectively which seems to be slightly higher than the proposed approach, but the computation time of our work is comparatively lower. However, some concern arises from these studies as Randhawa et al. performed the training on the data without reporting any hyperparameter values for the classifiers, which restricts the reproducibility of their experiments.

Also, they used a small number of samples from the SARS-CoV-2 group and did not perform any overfitting countermeasures. The study revealed 100% accuracies for their six classifiers over three different tests. This may be due to overfitting, which means that the finding may not be generalize over unseen data. On the other hand, Lopez-Rincon et al., used a significant imbalanced dataset,

where SARS-CoV-2 represents only the 11.93% of samples. Additionally, they utilized DNN, which requires a huge amount of data, is computationally extremely expensive, and features are unknown. Further, proposed approaches can only distinguish SARS-CoV-2 from other coronaviruses without including the control group. Hence, these works are not capable of knowing whether someone is infected with these types of virus or not. In contrast, the proposed DSP-ML based approach depicts comparatively acceptable classification results for the discrimination of J o u r n a l P r e -p r o o f SARS-CoV-2 and non-SARS-CoV-2, by employing newly proposed biomarkers, which only required genome sequence as input. DSP-ML is an alignment-free approach, ultrafast as it can be seen by the time-performance of ML via 10-fold CV for training datasets presented in Table 7 . sequences into numeric form, estimating the magnitude spectrum, average magnitude, peak-toaverage ratio using DFT, SVD, SVD with filtering, AMDF, AMDF with filtering, TDP, TDP with filtering based on the characteristics of three-base periodicity property and classification of SARS-CoV-2 and non-SARS-CoV-2 groups. The robust validation approach is fast and can cope with low length of DNA sequences. Hence, it can be deployed in more efficient ways for the prediction SARS-CoV-2 condition by using raw cDNA sequences as input. However, the study is restricted by the limited number of samples and will be required further investigation with larger data to confirm the efficacy of the proposed approach. The genome sequence data consists of partial CDS that has short length of sequence, which may perhaps enhance and/or degrade the results. The conventional mapping scheme could be replaced with the Pseudo-EIIP DNA symbolic-to-numeric mapping scheme, which may possibly reduce the computational overhead. We also use the raw data without any pre-processing. That may possibly influence the outcomes.

This study explores the significance of three-base periodicity for the prediction of SARS-CoV-2 virus. We derived eight biomarkers based on the three-base periodicity properties, using DSP techniques, and ranked those based on a filter-based biomarker selection method, which reduces the computation time and enhances the efficiency of the classifiers. The ranked biomarkers were fed to distinct classifiers for the prediction of SARS-CoV-2 coronavirus from other coronaviruses and a control group via 10-fold CV. In addition, a 10x10 CV paired t-test was performed to select the best model to test with the unseen data. The combination of ranked biomarkers (GB1 to GB5), and best supervised model (RF), is capable of differentiating the SARS-CoV-2 coronavirus with an accuracy of 97.47% and computation time of 0.31 seconds, which outperforms previous studies.

Our work includes various types SARS-CoV-2 data like complete genome, partial genome, partial and complete CDS, from the different regions that varies in length from 64bps to 29945bps, which shows the robustness and effectiveness of proposed approach and also ensures that our results are not affected by the imbalance dataset. Further, we plan to convert the proposed procedure into a computer-aided system that will allow the timely and efficiently differentiation of SARS-CoV-2 viruses from other viruses as early screening of novel viral outbreaks, which can lead to avoid the community transmission and decrease the mortality rate. Additionally, we plan to test the feasibility of proposed features for the classifications of different mutant of SARS-CoV-2 by deploying a multi-class classifier.

The architecture of SARS-CoV-2 transcriptome

Identification of Alpha and Beta Coronavirus in Wildlife Species in France: Bats, Rodents, Rabbits, and Hedgehogs

Structural insights into coronavirus entry

SARS and MERS: Recent insights into emerging coronaviruses

Detection of 2019 novel coronavirus (2019-ncov) by real-time rt-pcr

Molecular diagnosis of a novel coronavirus (2019-ncov) causing an outbreak of pneumonia

Crispr-based surveillance for covid-19 using genomically comprehensive machine learning design

Signal processing in sequence analysis: advances in eukaryotic gene prediction

Filter-based methodology for the location of hot spots in proteins and exons in DNA

Gene and exon prediction using time-domain algorithms

Improved time-domain approaches for locating exons in DNA using zero-phase filtering

Advanced protein coding region prediction applying robust SVD algorithm

Improved Singular Value Decomposition-based Exons Prediction Approach Using Forward-backward Filtering

Localization site prediction for membrane proteins by integrating rule and SVM classification

Building predictive models for MERS-CoV infections using data mining techniques

Identification of pathogenic viruses using genomic cepstral coefficients with radial basis function neural network

An efficient comparative machine learning-based metagenomics binning technique via using Random forest

Classifying proteins using gapped Markov feature pairs

Descriptive statistics of the genome: phylogenetic classification of viruses

A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network

An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes

Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning

Correlation-based feature selection for machine learning

Integrating correlation-based feature selection and clustering for improved cardiovascular disease diagnosis

Detection of novel coronaviruses in bats in Myanmar

MasterOfPores: A Workflow for the Analysis of Oxford Nanopore Direct RNA Sequencing Datasets

Digital Signal Processing: Signals, Systems and Filters

Set of rules for genomic signal downsampling

Non-parametric spectral estimation techniques for DNA sequence analysis and exon region prediction

A survey on feature selection methods. Computers & Electrical Engineering

Larranaga PA review of feature selection techniques in bioinformatics

Automatic quantitative analysis of human respired Carbon Dioxide Waveform for Asthma and non-Asthma classification using Support Vector Machine

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Kernel density estimation via diffusion

Current applications and future impact of machine learning in radiology

Quantitative radiomics studies for tissue characterization: a review of technology and methodological procedures

Predicting future cardiovascular events in patients with peripheral artery disease using electronic health record data. Circulation: Cardiovascular Quality and Outcomes

Evaluating the replicability of significance tests for comparing learning algorithms

LIBSVM: A library for support vector machines

Sensitivity, specificity, and predictive values: foundations, pliabilities, and pitfalls in research and practice. Frontiers in public health

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster

A pneumonia outbreak associated with a new coronavirus of probable bat origin

From SARS-CoV to Wuhan 2019-nCoV Outbreak: Similarity of Early Epidemic and Prediction of Future Trends

Measurement in medicine: the analysis of method comparison studies

We are thankful to Prof Mark Bradly for editing of the manuscript. The authors would like to express their deepest gratitude to the University of Edinburgh for providing supports to accomplish this research.J o u r n a l P r e -p r o o f

The authors declare no competing interests. J o u r n a l P r e -p r o o f

Manuscript title:The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.

The authors whose names are listed immediately below report the following details of affiliation or involvement in an organization or entity with a financial or non-financial interest in the subject matter or materials discussed in this manuscript. Please specify the nature of the conflict on a separate sheet of paper if the space below is inadequate.

Classification of SARS-CoV-2 and Non-SARS-CoV-2 Using Machine Learning Algorithms