key: cord-102766-n6mpdhyu authors: Alam, Md. Nafis Ul; Chowdhury, Umar Faruq title: Short k-mer Abundance Profiles Yield Robust Machine Learning Features and Accurate Classifiers for RNA Viruses date: 2020-06-25 journal: bioRxiv DOI: 10.1101/2020.06.25.170779 sha: doc_id: 102766 cord_uid: n6mpdhyu High throughout sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de-novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. Author Summary In this age of high-throughput sequencing, proper classification of copious amounts of sequence data remains to be a daunting challenge. Presently, sequence alignment methods are immediately assigned to the task. Owing to the selection forces of nature, there is considerable homology even between the sequences of different species which draws ambiguity to the results of alignment-based searches. Machine Learning methods are becoming more reliable for characterizing sequence data, but virus genomes are more variable than all forms of life and viruses with RNA-based genomes have gone overlooked in previous machine learning attempts. We designed a novel short k-mer based scoring criteria whereby a large number of highly robust numerical feature sets can be derived from sequence data. These features were able to accurately distinguish virus RNA from human transcripts with performance scores better than all previous reports. Our models were able to generalize well to distant species of viruses and mouse transcripts. The model correctly classifies the majority of false hits generated by current standard alignment tools. These findings strongly imply that this k-mer score based computational pipeline forges a highly informative, rich set of numerical machine learning features and similar pipelines can greatly advance the field of computational biology. of viruses can be radically different from all life, both in terms of molecular structure and 23 primary sequence. Alignment-based and profile-based searches are commonly employed for 24 characterization of assembled viral contigs from high-throughput sequencing data. Recent 25 attempts have highlighted the use of machine learning models for the task but these models rely 26 entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA 27 viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence 28 scoring method that generates robust sequence information for training machine learning 29 classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human 30 transcripts. We challenged our models with very stringent testing protocols across different 31 species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For 32 clean sequence data retrieved from curated databases, our models display near perfect accuracy, 33 outperforming all similar attempts previously reported. On de-novo assemblies of raw RNA-Seq 34 data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 35 depending on the software used for assembly. Our classifier was able to properly classify the 36 majority of the false hits generated by BLAST and HMMER3 searches on the same data. The 37 outstanding performance metrics of our model lays the groundwork for robust machine learning 38 methods for the automated annotation of sequence data. Viruses are numerous [1] and only a handful have been thoroughly characterized thus far [2] . As 62 we have entered and slowly progress through the age of automated, high-throughput genomics, 63 generation of sequence data itself is no longer of much concern, but rather accurate annotation of The bacterial 16sRNA gene tremendously aids the study of bacterial phylogeny [4] . The lack of 73 an omnipresent gene or genome segment that can be ubiquitously used to extract 74 phylogenetically relevant data about viruses, complicates the study of virus evolution [5, 6] . 75 There are additional levels to the structural complexity of viral genotypes. Viral genomes stand 76 out from those of all other genomic entities as they can be either single, double or gapped DNA 77 or RNA molecules with positive, negative or ambi-sense mechanism of genomic encoding; 78 whereas all known realms of life exclusively rely on DNA as the genetic material. These 79 categories of classification are built on top of the original Baltimore scheme[7] that sought to 80 characterize viral genomes with regard to expression of genetic information [8, 9] . 81 In the search for viruses in metagenomic data, researchers have noted that the usefulness of 82 homology based search tools are quickly exhausted [6, 10] . Most assembled contigs, likely to be 83 from viral origins, are short and fragmented with no guarantee of containing coding regions that 84 some of these tools rely on [11] . A number of software programs have been developed for the 85 purpose of identifying viral sequences that have integrated into host genomes [12] [13] [14] [15] . All 86 machine learning models built for identification and characterization of viral sequences from cell 87 samples or metagenomic data thus far rely on DNA sequences [16, 17] . RNA viruses comprise a 88 major group having great clinical and scientific importance [18] . Presently made even more 89 evident by the global pandemic caused by the 2019 novel SARS Coronavirus [19] . It has been 90 demonstrated that RNA-Seq data can be a very promising avenue for improving knowledge on 91 RNA viruses when leveraged by tactful algorithms [20] . Here, we present a novel short k-mer 92 based scoring function that can be applied to sequences of different types to achieve impressive 93 discriminatory capacity from classic machine learning models. We constructed a number of 94 classifiers that rely on the profile of short genetic elements across the genome to classify RNA 95 viruses from cellular mRNA and further classify them into positive or negative strand RNA 96 viruses. We tested the models rigorously using stringent cutoff parameters and many variable test 97 sets including noisy de-novo assembly data of cells cultured with Ebola virus. We evaluated the 98 performance of our selected model against nucleotide blast, protein blast and protein family 99 hmmer3 search. In spite of the stiff testing protocols, our k-mer based numerical features stood 100 out to be informative and robust datapoints for the classification of biological sequences by 101 machine learning algorithms. A total of 5460 features were contrived using the sequence scoring formula for k-mer elements 105 of length 1 to 6. The feature columns were tested for data normality by several methods. Q-Q 106 plots of the data columns were suggestive of a normal distribution (not shown but can be found 107 on git repository), but more robust statistical normality tests such as the Kolmogorov-Smirnov 108 test and Shapiro-Wilk test strongly delineated a non-normal distribution for all feature columns. Our features were therefore not suitable for univariate statistical analysis for feature selection. Scikit-learn's in-built tree-based feature importance values derived from our trained random 111 forest classifier were applied for feature selection on two levels. Similarly, scikit-learn's 112 optimized Lasso model applied to our trained logistic regression classifier with L1 penalty was 113 also used for feature selection on two succeeding levels. Using a combination of both our tree-114 based selections and Lasso, we obtained a total of 6 levels of feature importance categories (4 115 from the trees and Lasso themselves and 2 more for each method leading into the other on a 116 second level), each assigning a value of 1 to selected features and 0 to omitted ones (Fig 1) . The 117 top 3 feature sets ranked by importance from these 6 combined levels of selection are presented 118 in 5 CAATC, TCCAAT, GCAATC, TTGAC, CGTAGT, ACGGTT, TTGA, GTTGGT, CATCAA, ATACGG, ATCGTA, CGATAA, CGATTA, GTTGAT, CGAT, CGATC, CGGTT, GCGTAC, CGCGTT, TTGCG, CGTCGA, CAATCT, ATCAAC, ATTGG, ATAGCG, GATC, TTGCGC, GGTTG, TCGGTT, GTCAAT, ATCAAT, TTCGAC, CCGTTA, CCGATA, TCGA, GCGTTA 4 ATCATA, ACAATC, CGTA, TCGT, ATGGTA, GGTAGT, TGGT, CTGT, GTCGA, AGCTG, GTCAA, AAATAA, TCCTG, CAGC, TATCCG, CTGGA, CAATT, TCGGTA, AGGAA, ATCAA, ATAACG, TCCAAC, AATTGG, AAATG, TGGTTC, AACC, TTGATG, GGCGTA, TGGTT, CGTTGC, GTCGAC, TCAAC, CAGA, TCAATT, CAGAG, TCAAT, AGGTTG, TCAACC, CGCGTA, ATCCAA, GGTACG, GTCATA, ATTGGT, GCATAC, GGTATG, GATTGG, GTACGC, AACCGA, TCGCAA, TTTTTA, ACATCG, GCGTAA, TTTTTT, ACCGGT, GTTGAG, TAACAC, ATAGGG, CGCAA, CCGCAA, CGGTTG, TATGGT, AACGGT, TAGGGT, CTAACG, CGGTA, GTTGT, GGTTC, ACGATA, TTCGCA, CTCGAT, TCGAGT, AAAATG, TCCAT, GTCAAA, AATAAA, TTGGCG, GTTGC, GGTCAA, CTGTA, The summed feature score is obtained by summing the total number of times the feature appears 128 in the six filtered feature sets we obtained from our selection algorithm. The macro-averaged f1-score for the model was 88% and the binary f1-score was 81.83%. Among the others, the decision tree models with 5460 and 194 features and the Naïve Bayes 146 model trained with 68 features were the only models to obtain binary f1-scores below 95%. All 147 other models displayed resounding accuracy on both their cross-validation and test sets. AUROC of 0.95 (Fig 4) . The f1-macro score was 0.89 and the f1-binary score was 0.83 (Table 2 198 Part 2). Performance of this magnitude on novel divergent data concerns one to be critical of the to stand out to be good discriminators in biological data [15, 23] , particularly because of their 202 propensity against overfitting. Our 68-feature random forest model displayed impressive 203 classification accuracy and an AUROC of 0.96 on this divergent test set (Fig 4) . correctly classified 101 of these false positives giving us an accuracy of 75% on the unary data. When feeding the entirety of the human cell line assembly into our classification pipeline we 232 achieved an accuracy of 63% for transcripts of all sizes and a much-improved value of 95% 233 when taking transcripts of the same length range as our training data (Table 3) owing to the minimized skew of the separate classes (Fig 5) . The classification metrics from all 251 instances are presented in Table 3 . (Table 3) . To further investigate the 261 despondence of the model, we assessed the performance of the 10 assembly software separately. This revealed that performance varied significantly between the different assembly software that 263 joined the contigs (Fig 6) . genomes. By plotting the frequency of proteins discovered against the protein length cutoff 286 threshold, it was evident that at a length of about 100 amino acids, non-genic random ORF 287 numbers significantly subside (Fig 7) . Using this 100-amino-acid cutoff for 6-frame in-silico (Table 4) . In the same manner, we sorted out 2346 uninformative vFams from the high performance vFams 308 vFam A available at http://derisilab.ucsf.edu/software/vFam/. Using hmmsearch on our 309 informative vFams against the de-novo assembly dataset yielded 26003 false hits. Our model 310 was able to correctly classify 14089 of these false hits as human transcripts as noted in Table 4 . scale the model, we went beyond tree-based methods and employed Lasso regularization. Feature selection by Lasso stems from linear regression models by penalizing the coefficients of 361 the variables. Using a combination of these selection methods on multiple levels, we were able to 362 shrink the model down into only few features while only minimally compromising performance. Overfitting is always a concern when applying ML models to biological data. The first 364 impressions on the high accuracy of the classifiers suggested a case of overfitting in the model. The bootstrap aggregating or bagging meta-algorithm employed in random forest classification is 366 particularly resistant to high variance and overfitting. Although decision trees employ the same 367 bagging principle, because of their larger depths, they are more prone to higher variance and 368 overfitting. Many random sampling events generate subsets of data known as bootstrap samples. Isolated decision trees are trained on these bootstrap samples to construct the random forest 370 ensemble. A majority vote from the members of the ensemble generates predicted classifications 371 for test sets. Since our random forest models performed noticeably better than the decision tree 372 model, it would lead us to believe that overfitting does not account for the fidelity of our models. and the prevalence of many viral genomic remnants in the genomes of host organisms. We can 431 also reasonably conclude that protein-based methods such as protein blast or pfams are not suited 432 to the task of taxonomic classification of greatly divergent RNA sequences. Pertaining to the 433 nonuniformity of the overlap between the results from our classifier and the tools it was 434 compared to, we propose that a combinatorial approach employing multiple tools simultaneously 435 is currently the best option for segregating sequence data by taxonomy (Fig 10) . upgrade and models such as ours that facilitate the annotation of the assembled data will be 453 expected to catch up to these advancements. Contriving features using k-mer abundance profiles 457 We define where S is our set of all DNA sequences q. For all k-mers from k = 1 to k = 6, ∈ 458 and hence E is defined as the set of all possible k-mers of length k. In the equation that ∈ 459 follows, j is the total number of allowed mismatches or gaps in a score we obtain for any ( ) 460 given sequence q and for every element e, as the sum of all instances where is a slice of q 461 from the i-th letter of the word that matches element e, while allowing exactly j total mismatches 462 or gaps between and e. Here, is the length for each sequence q. This final score is used as the pre-scaled value for feature e in sequence q in all constructed 468 machine learning models in this study. Table 6 ). The reverse complement Here a virus, there a virus, everywhere the same virus? 537 Trends Microbiol Viruses in Soil Ecosystems: An Unknown Quantity Within an 539 Annual Review of Virology Emerging view of the human virome Ecological significance of microdiversity: identical 16S 543 rRNA gene sequences can be found in bacteria with highly divergent genomes and 544 ecophysiologies Unraveling the viral tapestry (from 546 inside the capsid out). Isme j Metagenomic contrasts of viruses in soil and aquatic 548 environments Expression of animal virus genomes Towards the system of viruses Genome replication/expression strategies of positive-strand RNA viruses: 553 a simple version of a combinatorial classification and prediction of new strategies Isolation independent methods of characterizing phage 556 communities 2: characterizing a metagenome VirFinder: a novel k-mer based tool for identifying viral sequences from 558 assembled metagenomic data. Microbiome Phage_Finder: automated identification and classification of prophage 560 regions in complete bacterial genome sequences Prophinder: a computational tool for prophage prediction in 562 prokaryotic genomes PHASTER: a better, faster version of the PHAST phage search tool Prophage Hunter: an integrative hunting tool for active prophages ViraMiner: Deep learning on raw DNA sequences for identifying viral 568 genomes in human samples Identifying viruses from metagenomic data using deep learning The evolutionary history of vertebrate RNA viruses Genomic characterisation and epidemiology of 2019 novel coronavirus: 574 implications for virus origins and receptor binding. The Lancet A bioinformatics approach reveals seven nearly RNA-virus genomes in bivalve RNA-seq data Machine Learning for detection of viral sequences in human 578 metagenomic datasets Profile Hidden Markov Models for the Detection of Viruses within 580 Gene selection and classification of 582 microarray data using random forest De novo transcriptome assembly: A comprehensive cross-584 species comparison of short-read RNA-Seq assemblers. GigaScience BLAST+: architecture and applications Origins and challenges of viral dark matter Discovering viral genomes in human metagenomic 590 data by predicting unknown protein families VirSorter: mining viral signal from microbial genomic data Discovering viral genomes in human metagenomic 594 data by predicting unknown protein families 16S Classifier: A Tool for Fast and Accurate Taxonomic 596 Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets Why are RNA virus mutation rates so damn high? Viral metagenomics Third generation sequencing: technology and its potential impact on 602 evolutionary biodiversity research Virus taxonomy: the database of the International Committee on 606 Nucleic Acids Research Accelerated Profile HMM Searches BinPacker: Packing-Based De Novo Transcriptome Assembly from RNA-610 seq Data Bridger: a new framework for de novo transcriptome assembly using 612 RNA-seq data Shannon: An Information-Optimal de Novo RNA-Seq Assembler rnaSPAdes: a de novo transcriptome assembler and 616 its application to RNA-Seq data IDBA-tran: a more robust de novo de Bruijn graph assembler for 621 transcriptomes with uneven expression levels SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq 623 reads De novo assembly and analysis of RNA-seq data Full-length transcriptome assembly from RNA-Seq data without a 627 reference genome Oases: robust de novo RNA-seq assembly across the dynamic range 629 of expression levels S1 Fig. Performance of the 18 Classifiers When Trained on Different Numbers of Features. 633 (a) Logistic Regression Classifier on 3 Feature Sets. (b) Linear SVM Classifier on 3 Feature Sets. (c) RBF-Kernel SVM Classifier on 3 Feature Sets. (d) Decision Tree Classifier on 3 Feature Sets. (e) Random Forest Classifier on 3 Feature Sets. (f) Gaussian Naïve Bayes Classifier 636 on 3 Feature Sets All k-meric features ranked by feature importance instated by our feature 638 selection flow-chart Complete Results for Leave-One-Family-Out Cross-Validation Procedure Information of Sequences Used to Test the Model's Ability to Generalize False Nucleotide Blast Hits Across Whole Human Transcriptome When Queried 642 Against the RNA Virus Sequence Database False Nucleotide Blast Hits Across Whole Mouse Transcriptome When Queried 644 Against the RNA Virus Sequence Database S6 Table. Detailed Information on Training Set Sequences This research received no specific grant from any funding in the public, commercial, or not-for-533 profit sectors. The findings and views expressed are that of the authors alone and none else.