key: cord-0962394-st6s7l0b
authors: Gomes, Juliana Carneiro; Masood, Aras Ismael; de S. Silva, Leandro Honorato; Ferreira, Janderson; Júnior, Agostinho A. F.; dos Santos Rocha, Allana Lais; Castro, Letícia; da Silva, Nathália R. C.; Fernandes, Bruno J. T.; dos Santos, Wellington Pinheiro
title: Optimizing the molecular diagnosis of Covid-19 by combining RT-PCR and a pseudo-convolutional machine learning approach to characterize virus DNA sequences
date: 2020-09-28
journal: bioRxiv
DOI: 10.1101/2020.06.02.129775
sha: bbcfc373e6f7e42d398a5ccd7fae712ea6a4b2f7
doc_id: 962394
cord_uid: st6s7l0b

The proliferation of the SARS-Cov-2 virus to the whole world caused more than 250,000 deaths worldwide and over 4 million confirmed cases. The severity of Covid-19, the exponential rate at which the virus proliferates, and the rapid exhaustion of the public health resources are critical factors. The RT-PCR with virus DNA identification is still the benchmark Covid-19 diagnosis method. In this work we propose a new technique for representing DNA sequences: they are divided into smaller sequences with overlap in a pseudo-convolutional approach, and represented by co-occurrence matrices. This technique analyzes the DNA sequences obtained by the RT-PCR method, eliminating sequence alignment. Through the proposed method, it is possible to identify virus sequences from a large database: 347,363 virus DNA sequences from 24 virus families and SARS-Cov-2. Experiments with all 24 virus families and SARS-Cov-2 (multi-class scenario) resulted 0.822222 ± 0.05613 for sensitivity and 0.99974 ± 0.00001 for specificity using Random Forests with 100 trees and 30% overlap. When we compared SARS-Cov-2 with similar-symptoms virus families, we got 0.97059 ± 0.03387 for sensitivity, and 0.99187 ± 0.00046 for specificity with MLP classifier and 30% overlap. In the real test scenario, in which SARS-Cov-2 is compared to Coronaviridae and healthy human DNA sequences, we got 0.98824 ± 001198 for sensitivity and 0.99860 ± 0.00020 for specificity with MLP and 50% overlap. Therefore, the molecular diagnosis of Covid-19 can be optimized by combining RT-PCR and our pseudo-convolutional method to identify SARS-Cov-2 DNA sequences faster with higher specificity and sensitivity.

At the end of 2019, the proliferation of the SARS-Cov-2 virus appeared in the city of Wuhan, China (Zhou et al., 2020) . In a few months, there are more than 250,000 deaths worldwide and over 4 million confirmed cases (WHO, 2020b) . Covid-19, as it became known, is a respiratory syndrome. In moderate 5 cases, it manifests clinically as pneumonia. In critical cases, a disease can lead to respiratory failure, septic shock, and/or multiple organ dysfunction (MOD) or failure (MOF) (Cascella et al., 2020; Peeri et al., 2020; Wang et al., 2020a) .

Besides the severity of the disease, the exponential rate at which the virus proliferates is an aggravating factor. The transmission of the virus often oc-10 curs through asymptomatic people. The contagion is given by drops or secretions from sneezing or coughing (Cascella et al., 2020) . Because of this, many countries have been experiencing overcrowding in their hospital centers. Most medical professionals are working long hours, and the number of pulmonary ven-tilators is not enough for all patients. This scenario has led dozens of countries to adopt measures of social isolation. They attempt to contain the dissemination, and to mitigate the number of people who need hospitalization (Hellewell et al., 2020; Wilder-Smith & Freedman, 2020; Kraemer et al., 2020) .

In response to this growing pandemic, several companies and research centers worldwide have researched and developed methods for diagnosing Covid-19 20 (Wang et al., 2020b) . Among them, rapid tests emerged, which can provide results in about 30 minutes. One type of rapid test is the Rapid Diagnostic Test (RDT). Through samples from the patient's respiratory tract, RDT seeks to detect the presence of antigens. Antigens are substances that are foreign to the body, causing immune responses. These responses produce specific antibodies, 25 capable of binding to and interacting with the antigen, ensuring the protection of the organism. Thus, in tests of the RDT type, antibodies are fixed on paper tapes and placed in plastic capsules, similar to the well-known pregnancy tests.

If the target antigen is present in the patient's sample at certain concentrations, it will attach to the antibodies on the tape, generating a visual signal. tunately, this method has some restrictions. First, it is only possible to detect in the acute stages of infection, when antigens are expressed. In addition, efficiency depends on factors such as quality and the collection protocol and the formulation of reagents. We must also emphasize that the possibility of false positives, when the antibodies present on the tape recognize antigens from other 35 types of viruses. For these reasons, the sensitivity of the RDT can vary from 34 to 80% (Bruning et al., 2017; WHO, 2020a) .

Another type of rapid test is based on host antibody detection. In this case, antibodies are detected in the patient's blood samples, depending on factors such as age, nutrition, disease severity and medications. However, recent studies 40 have shown that the immune response is very weak, late or even absent in many cases of patients confirmed with Covid-19 (Döhla et al., 2020; Patel et al., 2020; Burog et al., 2020; Li et al., 2020; Liu et al., 2020; Pan et al., 2020) . This means that this type of detection is often only possible in cases of recovered patients. The study reports 285 patients who tested 45 positive for IgG. However, these immune responses were seen 19 days after the first symptoms. This condition makes testing ineffective in many situations, as opportunities for treatment and clinical interventions no longer exist. Therefore, WHO does not currently recommend these types of rapid diagnostic tests for . The suggestion is to use them in research contexts or as a way of 50 screening patients, or of potential diagnosis (WHO, 2020a) . Therefore, the benchmark for Covid-19 diagnosis is molecular diagnosis or RT-PCR with DNA sequencing and identification (Patel et al., 2020; Tahamtan & Ardebili, 2020 produces double-stranded DNA, which is a copy of the virus's RNA. Then, the PCR exponentially amplifies fragments of this DNA during successive cycles, generating millions of copies to be analyzed. In the following, the cDNA is aligned with sequences from the SARS-Cov2 virus. Sequence alignment is a traditional method for analyzing similarity between sequences. Among the most 65 consolidated methods are BLAST and FASTA. If there is a match between both sequences, then the patient is confirmed positive. Otherwise, the patient is considered negative for Covid-19 (Bosco & Di Gangi, 2016; Rizzo et al., 2015; Zhang & Harmon, 2020; Chan et al., 2020) .

Although so far RT-PCR with DNA identification is considered the most ac-70 curate and effective method, there are still some weaknesses. A major limitation of the sequence alignment methods is the computational complexity and time consumption. In many cases, patients can take days to receive the diagnosis due to sample preparation and genomic analysis. Because of this, several studies have proposed alignment free methods for genomic sequences classification.

Most of these methodologies involve a feature extraction method such as spec-tral representation of DNA sequences. Thus, the representative attributes of the sequence can be combined with methods of artificial intelligence, especially machine learning. This makes possible to separate each analyzed sequence into a class (Covid-19 positive or Covid-19 negative, for example) (Bosco & Di Gangi, 80 2016; Rizzo et al., 2015) .

In this work we propose a new technique for representing sequences based on the analysis of the relationships between nitrogenous bases. This technique analyzes the DNA sequences obtained by the RT-PCR method, eliminating the alignment process. The idea is as follows: a DNA sequence is divided into n 85 smaller sequences. Each subsequence i is superimposed with a part of the subsequence i−1 and with a part of the subsequence i+1, giving rise to two new subsequences. These smaller sequences are represented by co-occurrence matrices.

The matrices are square with 4x4 dimensions, with number of rows and columns corresponding to each of the nitrogenous bases of DNA (Adenine, Cytosine, 90 Thymine, and Guanine). The co-occurrence matrix considers the occurrence of each of the bases, as well as the relationship between bases and their immediate neighbors. Then, the co-occurrence matrices are stacked together, forming a volume. Considering that the sequences can be subdivided into smaller and smaller subsets, with the formation of new co-occurrence matrices, the proposed 95 method has a pseudo-convolutional aspect from the algorithmic point of view.

After obtaining the set of matrices, they are then concatenated, forming attribute vectors. These extracted attributes correspond to a high-level vectorial representation of the initial DNA sequence, independent from the size of the sequence. This feature vector can be classified by machine learning techniques.

Through the proposed method, it is possible to identify virus sequences from a relatively large database. Several advantages can be pointed out with this approach: First, it is not necessary to pre-align the sequence under investigation in relation to the reference sequences; Second, the sequence under study is compared with a wide set of sequences of given classes, and not just with a 105 reference sequence, strengthening the reliability of the test. We also emphasize that the method can be applied to sequences of any size.

The present work seeks to describe and test the new method of feature extraction to represent sequences of nitrogenous bases. Our main objective is to optimize the RT-PCR, the benchmark for Covid-19 diagnosis. To reach this 110 goal, we used genomic sequences of different viruses obtained in the repository VIPR (Virus Pathogen Resource) Pickett et al. (2012) . We used 24 virus families with more than 500 sequences each, including the SARS-Cov2 family. Each sequence was submitted to the representation process described here. In the following, we performed multiple experiments with different machine learning 115 methods. (method) presented a superior performance, considering four metrics (accuracy, kappa index, sensibility and specificity).

This work is organized as following: in section 2 we present a brief of the state-of-the-art of DNA methods; in section 3 we present our methodology, including our proposal, the description of the database, the experiments param-120 eters and the metrics used for performance measure. In section 4 we provide our experiments results and make analysis of them; finally, in section 6 we summarized the scientific contribution of this work and discusses the potential future work. These values can be considered high, in comparison with results obtained in other studies (Cassaniti et al., 2020) . The work also tested the performance of the method in 10 patients using peripheral blood. The results remained reliable. Thus, the work is promising and points out an interesting path for a PCR method. In addition, they selected tests that can be performed quickly in an emergency context. The selected blood tests were complete blood count, creatinine, potassium, sodium, C-reactive protein, in addition to the patient's age.

Considering the imbalance of the database, the work used SMOTE techniques (Synthetic Minority Oversampling Technique) (Chawla et al., 2002; Lusa et al., 205 2013), which is capable of generating synthetic data from the minority class.

Then they trained 10 support vector machines (SVM). The initial prediction model corresponds to the average probability of the 10 models developed. The testing and training processes were performed 100 times, using different subsets, with a 90% percentage split for training and 10% for testing. All models and 210 statistics were obtained using R. The authors achieved an average specificity of 85.98%, an average sensitivity of 70.25%, a negative predictive value (NPV) of 94.92%, and a positive predictive value (PPV) of 44.96%. For the last metric, the authors believe that severe cases, however negative for Covid-19, generated more confusion in the classification. Another study Barbosa et al. (2020) , using 215 the same initial database, applied attribute extraction methods (Particle Swarm Optimization) to search for the best tests among the 108 initial ones. Then, the authors manually selected exams in order to reduce costs. The result was 24 selected exams, with performance similar to the initial base. The results of the evaluation metrics were: 95.16% of average accuracy, sensitivity of 0.969, speci-220 ficity of 0.936 and 0.903 of kappa index. The authors made a desktop version of the system available for free non-commercial use.

While rapid diagnostic methods are important and optimize this process, the gold standard and recommendation of WHO is still the RT-PCR method with DNA sequencing (WHO, 2020a), similar to the method developed for the 225 diagnosis of SARS-Cov (Chan et al., 2004; Emery et al., 2004; Corman et al., 2012) . Thereby, multiple studies and protocols for identifying SARS-Cov2 by molecular diagnosis have already been published (Corman et al., 2020b,a; Poon et al., 2020; Chu et al., 2020; Nao et al., 2020) . Chu et al. (2020) developed RT-PCR assays to detect SARS-Cov2 in human clinical samples. The authors relied on the first publication of the virus sequence on Genbank, in addition to sequences of other types of coronavirus to perform the alignment. Thus, they designed two monoplex assays, which target the ORF1b and N gene regions.

Then, these primer and probe sequences were confirmed with other released SARS-Cov2 sequences. RT-PCR reactions were performed by a thermal cycler, being recommended as a screening assay, and the Orfb1 as a confirmatory one.

The biggest difficulty, however, is that RT-PCR is time-consuming and labour intensive, and consequently, its result can take days to be available (Ai et al., 2020) . This makes clinical conduct difficult and favors the contamination of more people by SARS-Cov2. In this sense, the objective of this work is to 255 propose an optimization of the gold standard method.

Our work considers genome sequences of several virus types, where each sequence is organized into a single matrix. Initially, the genome sequence is 260 divided into n subsequences, which will then be overlapped with its neighbors.

In the overlapping process, a parameter received by the method determines the size of the superimposed pieces. Every subsequence i is combined with a piece of the subsequence immediately to its left i − 1, and also with a piece of the one to its right, i + 1. An exception is made for the first and last sequence of 265 the matrix, given that they have only one subsequence from which to take a piece. This procedure results in two new sequences for each of the subsequences generated from the original genome.

After that, these smaller sequences are represented by co-occurrence matri- In general terms:

After obtaining this set of matrices, they are then concatenated, forming attribute vectors. These extracted attributes correspond to a high-level vectorial representation of the initial DNA sequence, independent from its size.

This process is illustrated in the following diagram in Figure 1 .

In order to verify the proposed method's efficiency in extracting characteristics from genome, different classifiers will process the data. The following classifiers were selected because they are widely used in machine learning.

This classifier uses decision trees as its building blocks, Tin Kam Ho (1995) .

Decision trees, as illustrated in Figure 2 , iteratively separate data by testing a property at a time, the resulting leafs representing the most specific category, and the root representing the raw data. The Random Forest is constructed of many such trees, that all have its own class prediction to any given input. The class with the most votes is the Random Forest's output. As the characteristics that divide the genomes evaluated aren't known, this method is advantageous because it verifies many possibly relevant properties.

Thus, it can test and locate differences in the genetic code in question.

This machine learning model uses probability, specifically the Bayes theo-295 rem, Maron (1961) . The Bayes Theorem defines the probability of an event A happening, given that another event B has already taken place. The Bayes

Theorem can be expressed as:

It is called naive because it assumes independence in the features that lead to the events. Furthermore, it assumes all predictors have an equal weight.

This approach is beneficial because it explores the possibility that the genomes have dividing properties that are not correlated. Should that be the case, this classifier might yield good results.

This algorithm, also known as IBK, Altman (1992) , doesn't construct a 305 model, but instead predicts by using a distance k between samples in the training set and a test sample. The training set instances selected generate the prediction, as demonstrated in Figure 3 . It could prove to be successful, be-cause it classifies by finding similar instances. Thus, it might be able to identify genome sequences that belong to the same virus. 

This classifier, shown in Figure 4 , is a neural network capable of solving non linear data problems, Minsky & Papert (1969) . Each neuron unit has weights that multiply the input, which is in turn processed by an activation function to generate the output. The weights are adjusted until the network can satisfy a 315 certain accuracy in output. In this manner, it could identify the features that are particular to each class. 

This algorithm, Cortes & Vapnik (1995) , hopes to find an optimal hyperplane that can separate the data into classes, as exemplified in Figure 5 . The plane will 320 have n dimensions, according to the number of features. The support vectors are the samples closest to the dividing hyperplane, that aid in its construction.

Thus, it could be used to classify the genomes by dividing them with such a hyperplane. The second dataset used in this paper is from the Genome Reference Con-335 sortium Consortium (2013). Its purpose was to represent the human genome, and it has 103,959 samples.

Various experiments were constructed to evaluate feature extraction method's quality. They aim to simulate different use cases wherein SARS-CoV2 could 340 need to be identified. There is a multiclass experiment, a binary classification, classification of viruses with similar symptoms and a real test scenario.

This experiment's purpose is to differentiate SARS-CoV2 and the other viruses listed in table 1 from each other. In it, all 25 classes of the table 1 345 were used to build the database, that was split in training set and test set. In classes with more than 500 instances, the training set consisted of 500 them, and the rest were used in testing. The classes with less than a 500 samples had 70% of their samples allocated for training and 30% for testing. Additionally, the feature extraction hyperparameter n was set to 4, and overlap was tested 350 at 30%, 50%, and 70%.

This test was utilized to analyze the proposed method's efficiency in differ- Overlap was set to 30%, 50%, and 70%.

This test included three classes: the human genome, from the Genome Ref- and test splitting was performed as previously established, and the value of n remained the same. Furthermore, the overlap was also tested at 30%, 50%, and 375 70%.

• Confusion Matrix

The confusion matrix provides a more straightforward structure for the portrayal of the model's output, wherein the rows represent its predictions, On the other hand, the number of misclassified instances is obtained from the opposite diagonal.

• Accuracy

The accuracy describes the rate of correct classification of instances and 390 is the most commonly used metric in machine learning. Considering a confusion matrix T = [t i,j ] n×n for a classification task with n classes, in which i denotes the index of the i-th true class and j points to the index of the class associated to the classification decision, the j-th class, the accuracy is defined as following:

(3)

• Kappa Coefficient

The Kappa Coefficient (κ) assesses the relation between the classified instances. It is defined as:

where

• Precision

Precision indicates the proportion of positive and correct classification, and is thus calculated:

where TP is the number of true positives and FP is the amount of false positives.

• Recall

Recall measures the proportion of actual positives correctly classified by the model. It is computed by:

where FN is the number of false negatives.

The sensitivity, or True Positive Rate, is given by:

• Specificity

The specificity, or True Negative Rate (TNR), if defined as following:

where TN is the number of true negatives.

• Area Under the ROC Curve 

Thus, the Area Under the ROC Curve (AUC) measures performance for all possible thresholds of classification in a given model, and therefore it portrays the quality of results independently of it. is less sensitive to the high imbalanced test dataset, it is a better evaluation above 0.99 on weighted average specificity, so the Random Forest is presented as a robust classifier for this task.

Aiming to evaluate the overlap percentage in the feature extraction method, 445 Figure 9 shows box plots for accuracy, Kappa statistic, weighted average precision, recall and ROC area for the Random Forest classifier in the datasets with 30%, 50%, and 70% overlap percentages. The variance of accuracy and kappa in the dataset with 30% overlap is higher than in the 50% and 70% overlap dataset. However, 30% overlap seems to be slightly better (or at least at the 450 same level) as the others overlap percentages.

Because of class unbalancing in the test dataset, we need to evaluate sensitivity, specificity, and ROC area for each class individually. Considering the Random Forest classifier in the dataset with 30% overlap, Table 3 shows the results of sensitivity, specificity, and ROC area individually for each virus in 455 the database. Specificity and ROC Area results are above 0.9 for every virus.

The sensitivity varies from 0.99391 for Pneumoviridae to 0.23397 for Filoriviridae. However, for most of the classes, sensitivity has values greater than 0.8 (including SARS-Cov2 class with a sensitivity of 0.82).

In order to perform a visual analysis of these results, Figure 10 shows the average confusion matrix for the Random Forest classifier in the dataset with 30% overlap. The confusion matrix is expressed in terms of percentage for the particular class, and the classes indexes numbers are the same as shown in Table   3 . We can see that for some classes, there is a confusion with another virus. and Coronaviridae since 11% of SARS-Cov2 are misclassified as Coronaviridae.

Since the ROC area for SARS-Cov2 is 0.99883 (Table 3) , we performed a threshold adjustment for SARS-Cov2 class in order to reach 100% sensitivity.

The new average confusion matrix is shown in Figure 11 . Higher false positives for SARS-Cov2 remains from Coronaviridae (5,1% -index 17). In the sequence of false positive rates, we have: Hepatitis C virus (3,47% -index 20), Reoviridae (3,19% -index 22) , and Phasma Viridae (2,68% index 23). 

Given that, in the multiclass scenario, the highest false positives for SARS- other metrics, MLP seems to be a more robust classifier for this particular task. Table 4 shows the sensitivity, specificity and ROC Area for each class. It is possible to notice that each one of those metrics has values above 0.96. Figure 13 shows 

In this experiment, viruses were selected due to similar symptoms. The dataset was arranged into four classes: SARS-Cov2, Coronaviridae, Paramyx- oviridae, and Miscellaneous. The Miscellaneous Class is a compound of Peneumoviridae, Hantaviridae, Enterovirus, and Nairoviridae. Then, the same classifiers used previously were evaluated in this classification task.

515 Figures 14 and 15 shows the accuracy and kappa for all classifiers and datasets in this classification task. Except for the Naive Bayesian classifier, classifiers have similar performance metrics, with approximately 97% accuracy and kappa equal to 0.96. Figure 16 shows the weighted average specificity and sensitivity and ROC are. The weighted average sensitivity and specificity 520 look very similar to all classifiers (except Naive Bayes Classifier). However, the weighted average ROC area for MLP and Random Forest classifiers is slightly higher than the other classifiers, although IBK and SVM classifiers also achieve a weighted average ROC area above 0.98 in all datasets.

In order to better evaluate the MLP and Random Classifier, Figure 17 shows 525 the confusion matrices for those classifiers in all datasets. The Random Forest presents a confusion between the SARS-Cov2 and the Coronaviridae of approximately 10%. It is very similar to the achieved results in the multiclass scenario.

However, the MLP classifier achieves significantly low-level confusions between SARS-Cov2 and Coronaviridae (1.57% in the datasets with 30% and 50% over-530 lap). The main confusion found in the MLP classifier is between Conronarividae and SARS-Cov2 (3.81% for the dataset with 30% overlap). By MLP confusion matrix analysis is not possible to find significant differences between the 30%, 50%, or 70% overlap percentages. Since the 30% overlap requires less compu- tational effort to extract the features, we can select the MLP classifier with a 535 30% overlap dataset as a better approach to this particular task. The Table   5 shows the sensitivity, specificity and ROC area for each class. The average ROC Area and specificity are above 0.99 for all classes. The average sensitivity is also above 0.99 for the Paramyxoviridae and Miscellaneous classes. The lowers sensitivity is for Coronaviridae (0.959), while a slightly higher sensitivity is 540 achieved for SARS-Cov2 (0.97). 

In this scenario, the SARS-Cov2 test is designed as a three-class classification problem: SARS-Cov2 (the test target), GRCh38 (the healthy human reference), and Coronaviridae (a virus control sample). The same classifies used in the other 545 experiments were applied to this new task. Figure 18 shows the accuracy and Figure 19 shows the kappa statistic results.

Except for the Naive Bayes Classifier, all other classifiers have accuracy above 99% kappa above 0.9. By these metrics, It is not possible to distinguish the best classifier. The same behavior is observed in the weighted average metrics 550 shown in Figure 20 . Weighted average sensitivity, specificity, and ROC area are higher than 0.99 for all classifiers except the Naive Bayes Classifier. Table 6 shows the sensitivity, specificity and ROC Area for each of the classes for this MLP classifier. 

Regarding the feature extraction methods, it seems to capture the structure of the viruses' genome sequence. Random Forest classifier achieved the best overall performance for multiclass scenarios, while MLP classifier presented the best results for scenarios with fewer classes.

Evaluating the parameters for the feature extraction proposed method, splitting the viruses' genome sequence into four folders (n = 4) seems to be enough to produce representative features. Regarding the overlap percentage, the proposed feature extraction method is not very sensitive to this parameter, even though 30% to 50% seems to be enough to produce good features representa-575 tions.

The first multiclass scenario (with 25 viruses classes) is an extreme case scenario. Nevertheless, the Random Forest classifier achieved sensitivity and specificity above 0.9 for many classes. For those classes with lower sensitivity, the confusion matrix shows that most confusions are particular between 580 two viruses. For example, Filoriviridae is the class with a lower sensitivity rate (0.23). However, checking the confusion matrix, on average, 76.27% of Filoriviridae are misclassified as Ebola Virus. There is no other significant confusion for Filoriviridae, so it is possible to design a classifier cascade to solve this specific confusion between two viruses.

One particular virus class is the Pharma Viridae since it has only 42 samples in the dataset (30 used for training and 12 for testing). Even with this small amount of samples in the multiclass scenario, the significant misclassifications for Pharma Viridae are Henteraviridae (22.78%), and Peribunyavirida (35.26%).

With a larger sample size for the Pharma Viridae, classifiers could find a better 590 boundary decision reducing this level of false-negative rate. However, for this particular class, three-classes cascade classifiers could be evaluated to deal with these types of errors.

Regardless of the feature extraction parameters or even the used classifier, there is still a 3-4% of Coronaviridae samples misclassified as SARS-Cov2. How-ever, this is an expected outcome, since SARS-Cov2 belongs to the Coronaviridae family. Visualizing the extracted features, we found some samples of SARS-Cov2 and Coronaviridae that can not be distinguished, as showed in Figure 22 .

So, it is tough for any classifier to separate those two classes optimally. 

In this work we presented a novel method to represent DNA sequences by using pseudo-convolutions and co-occurrence matrices. With this method, we were able to represent hundreds of thousands of DNA sequences from 24 virus families. Then we separated SARS-Cov-2 sequences from the Coronaviridae family and demonstrated that our model is able to differentiate all virus families 605 present on our database. SARS-Cov-2 was discriminated from virus families other than Coronaviridade and even from other coronaviruses with very high sensitivity and specificity.

We aimed to show the capabilities of optimizing the molecular diagnosis of Covid-19 by combining RT-PCR, the actual ground-truth Covid-19 diagnostic 610 method, and our pseudo-convolutional method to identify SARS-Cov-2 DNA sequences faster.

From the obtained results, we can assume that the proposed pseudo-convolutional approach is able to characterize SARS-Cov-2 DNA sequences. This new representation of DNA sequences can be successfully used as a feature extraction 615 stage to full connected networks, in order to use the deep learning philosophy, or other classical classification architectures. The evaluation of the proposed approach in real test scenarios, necessarily reduced to a limited set of virus families and healthy human sample DNA, showed high sensitivity (higher than 0.988) and specificity (higher than 0.998) rate as well. Hence, other researchers 620 can use our solution and our methods to improve their results to diagnose Covid-19 faster with accuracies even higher than the state-of-the-art methods.

Era of molecular diagnosis for pathogen identification of unexplained pneumonia, lessons to be learned

An introduction to kernel and nearest-neighbor nonparametric regression

Extracting possibly representative COVID-19 biomarkers from x-ray images with deep learning approach and image data related to pulmonary diseases

Covid-19: automatic detec-645 tion from x-ray images utilizing transfer learning with convolutional neural networks

Heg.ia: An intelligent system to support diagnosis of covid-19 based on blood tests

Deep learning architectures for dna sequence classification

Rapid tests for influenza, respiratory syncytial virus, and other respiratory viruses: a systematic review and metaanalysis

Should IgM/IgG rapid test kit be used in the 660 diagnosis of COVID-19? Asia Pacific Center for Evidence Based Healthcare

Features, evaluation and treatment coronavirus (covid-19). In StatPearls

Performance of vivadiag COVID-19 IgM/IgG rapid test is inadequate for diagnosis of COVID-19 in acute patients referring to emergency room department

Improved molecular diagnosis of COVID-19 by the novel, highly sensitive and specific COVID-19-RdRp/Hel real-time reverse transcription-PCR assay validated in vitro and with clinical specimens

Laboratory diagnosis of sars

SMOTE: synthetic minority over-sampling technique

Molecular diagnosis of a novel coronavirus (2019-nCoV) causing an outbreak of pneumonia

Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes

Genome Reference Consortium Human Build 38 . grc

Diagnostic detection of 2019-nCoV by real-time RT-PCR. World Health Or-695 ganization

Detection of a novel human coronavirus by real-time reverse-transcription polymerase chain reaction

Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR

Support-vector networks

Rapid point-of-care testing for SARS-CoV-2 in a community screening setting shows low sensitivity

Real-time reverse transcription-polymerase chain reaction assay for SARS-associated coronavirus

Ikonos: An intelligent tool to support diagnosis of covid-19 by texture analysis of x-ray images. medRxiv

Fea-720 sibility of controlling COVID-19 outbreaks by isolation of cases and contacts. The Lancet Global Health

The effect of human mobility and control measures on the COVID-19 epidemic in China

Development and clinical application of a rapid IgM-IgG combined antibody test for SARS-CoV-2 infection diagnosis

Diagnostic indexes of a rapid IgG/IgM combined antibody test for SARS-CoV-2. medRxiv

Antibody responses

CoV-2 in patients with COVID-19

Smote for high-dimensional class-imbalanced data

Automatic indexing: An experimental inquiry

Perceptrons: An Introduction to Computational Geometry

Detection of second case of 2019-ncov infection in japan

Automatic detection of coronavirus 745 disease (COVID-19) using x-ray images and deep convolutional neural networks

Serological immunochromatographic approach in diagnosis with SARS-CoV-2 infected COVID-19 patients

Report from the american society for microbiology COVID-19 international summit

The SARS, MERS and novel coronavirus (COVID-19) epidemics, the newest and biggest global health threats: what lessons have we learned?

ViPR: an open bioinformatics database and analysis resource for virology research

Detection of 2019 novel coronavirus (2019-nCoV) in suspected human cases by RT-PCR. School of Public Health

A deep learning ap-770 proach to DNA sequence classification

Detection of coronavirus disease (Covid-19) based on deep features

A novel specific artificial intelligence-based method to identify COVID-19 cases using simple blood exams. medRxiv

Real-time RT-PCR in COVID-19 detec-780 tion: issues affecting the results

Random decision forests

Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China

Detection of SARS-CoV-2 in different types of clinical specimens

Advice on the use of point-of-care immunodiagnostic tests for COVID-19 . World Health Organization. URL: www.who.int/news-room/commentaries/detail/ advice-on-the-use-of-point-of-care-immunodiagnostic-tests-for-covid-19 795 last accessed

WHO Coronavirus Disease (Covid-19) Dashboard. World Health Organization

Isolation, quarantine, social distancing and community containment: pivotal role for old-style public health 800 measures in the novel coronavirus (2019-nCoV) outbreak

RNA extraction from swine samples and detection of influenza a virus in swine by real-time RT-PCR

Evaluation of recombinant nucleocapsid and spike proteins for serological diagnosis of novel coronavirus disease

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in wuhan, china: a retrospective cohort study. The Lancet

We are grateful to the Brazilian research-funding agency CNPq, for the partial support of this research.

All authors declare they have no conflicts of interest.

This study was partially funded by the Brazilian research agency Conselho Nacional de Desenvolvimento Científico e Tecnológico, CNPq.

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.