key: cord-0937760-v34b46pu authors: Dasari, Chandra Mohan; Bhukya, Raju title: Explainable deep neural networks for novel viral genome prediction date: 2021-06-25 journal: Appl Intell DOI: 10.1007/s10489-021-02572-3 sha: 5c3c6351068bd9a379ea0c4f008cfe8358456040 doc_id: 937760 cord_uid: v34b46pu Viral infection causes a wide variety of human diseases including cancer and COVID-19. Viruses invade host cells and associate with host molecules, potentially disrupting the normal function of hosts that leads to fatal diseases. Novel viral genome prediction is crucial for understanding the complex viral diseases like AIDS and Ebola. While most existing computational techniques classify viral genomes, the efficiency of the classification depends solely on the structural features extracted. The state-of-the-art DNN models achieved excellent performance by automatic extraction of classification features, but the degree of model explainability is relatively poor. During model training for viral prediction, proposed CNN, CNN-LSTM based methods (EdeepVPP, EdeepVPP-hybrid) automatically extracts features. EdeepVPP also performs model interpretability in order to extract the most important patterns that cause viral genomes through learned filters. It is an interpretable CNN model that extracts vital biologically relevant patterns (features) from feature maps of viral sequences. The EdeepVPP-hybrid predictor outperforms all the existing methods by achieving 0.992 mean AUC-ROC and 0.990 AUC-PR on 19 human metagenomic contig experiment datasets using 10-fold cross-validation. We evaluate the ability of CNN filters to detect patterns across high average activation values. To further asses the robustness of EdeepVPP model, we perform leave-one-experiment-out cross-validation. It can work as a recommendation system to further analyze the raw sequences labeled as ‘unknown’ by alignment-based methods. We show that our interpretable model can extract patterns that are considered to be the most important features for predicting virus sequences through learned filters. such as tissue, blood, urine, cells, RNA, DNA, or protein, from plants, animals, or humans. A metagenomic analysis is useful in many fields like ecology, bioremediation, and biotechnology. Viral metagenomics only deals with the discovery and detection of novel viruses [9, 21, 24, 25, 31, 38, 66, 67, 69 ]. Various current methods for the identification of viral genomes can be broadly classified into approaches based on alignment and machine learning. In the traditional alignment-based approach, viral sequence detection in human biospecimens is normally performed using BLAST [2] . In the first approach, the sequences are compared to known publicly available databases and classify the sequences based on the similarity index. Metagenomic datasets contain divergent virus sequences so there is no similarity at all among known database sequences. As a result, many of the sequences of viruses generated from sequencing technologies are classified as "unknown" by the NCBI BLAST [9, 38] . The most popular alignmentbased techniques for viral genome classification are REGA [1, 48] , USEARCH [19] , and SCUEAL [49] . All these methods purely depend on the alignment score between the viral sequence being classified and the reference dataset. The major drawbacks of alignment-based approaches include, the classification performance purely depends on the selection of one of the several initial alignments and hyperparameters. These methods are expensive and their performance is unstable for divergent regions of the genome. Another tool for virus sequence detection within metagenomic sequence datasets is HMMER3 [44] , which uses profile Hidden Markov Models by comparing with vFams [61] database. vFams, a database with viral family proteins was designed by Multiple Sequence Alignments (MSA) from all RefSeq viral proteins. HMMER3 detects homological viral sequences more effectively but not highly divergent ones [13] because it depends on the reference database VFams. In the second type of approach, several methods are proposed to classify viral metagenomic sequences [3, 54] . VirSorter [57] , a probabilistic tool to predict novel viruses in microbial genome data with and without reference. The model is evaluated on 3kb to 10kb sequence length metagenomic contig datasets and its performance increases with the sequence length. VirFinder [52] , machine learning model to identify viral contigs based on k-mer frequency. In this model the sequence length is in the range 1kb to 5kb, which shows better results than VirSorter on small length sequences. The existing recommendation like system ViraPipe [14] used an artificial neural network and random forest by using relative synonymous codon usage frequency to improve the classification of metagenomic data into a virus and non-virus sequences. This model identified two codons (CGC and TCG) which are shown to have strong discriminative ability. These methods still confront many problems, such as their inability to extract useful hidden information from basic DNA data. Machine learning algorithms efficiency depends solely on the features that have been extracted. A machine learning model is interpretable if humans can comprehend it by observing model parameters and how it makes decisions on their own. On other hand, explainable models are too complex to understand and require additional techniques in order to follow how it works. The decision tree is a interpretable machine learning model where as the random forest is explainable [22] . Deep learning is applied for automated extraction of features due to technical advances. It is a well-known technique that has produced excellent results in the field of Natural Language Processing (NLP) [18, 64] , image and video processing [36] . Deep learning applications in the bioinformatics, genomics and computational biology mostly concentrate in (i) genome sequencing and analysis [15, 32, 51, 72] (ii) classification of DNA [33, 56] , chromatin [70] , polyadenylation [26] , and (iii) protein structure prediction [20, 62, 68, 72] . Viral genome deep classifier [23], a CNN model have been proposed for classifying viral sequences into subtypes. Long Short-Term Memory (LSTM) [58] networks have excelled in the field of NLP in recent years, especially when modelling short sentences with hundreds of words [41] . ViraMiner [65] is a model that uses Convolutional Neural Networks (CNN) to detect the viral genome sequences from human metagenomic samples. It contains pattern and frequency branches, each one is trained separately. The pattern and frequency branch achieved 0.905 and 0.917 AUC-ROC values respectively. The combination of both pattern and frequency achieved AUC-ROC of 0.923. DeepVirFinder [53] is a CNN based model used for identifying viral sequences in metagenomics. It achieved 0.93, 0.95, 0.97, and 0.98 AUC-ROC for the viral sequences of length 300, 500, 1000, and 3000 bp respectively, which infers that performance improves along with sequence length. RNN-VirSeeker [42] , an LSTM based method to identify short viral sequences from CAMI and human gut metagenome datasets, with sequence length 500 bp that exhibited mean AUC-ROC of 0.9175. AUC-ROC is best metric for evaluating model performance on balanced datasets. The human metagenomic datasets are highly imbalanced so, AUC-PR is considered as the best metric. As CNN is widely criticized for their black-box design, the rational between input and output can't be properly observable. While these existing CNN methods are used to detect viral genomes, none of the above mentioned methods performs model interpretability. The deep learning models are non-transparent as we can't decipher any knowledge by only peaking at neuron weights directly so, there is a demand for model expainability. In the context of deep learning the terms interpretability and explainability are often used interchangeably [60] . Recently, some explainable models are used for image analysis [60] , viral genome prediction [8] , and other various tasks [5, 17, 40, 50, 71] . This review [60] presents a study of the latest implementations of explainable deep learning for various medical imaging tasks along with developing a framework for clinical end-users. Two-stage CNN architecture was developed in work [5] , which employs gradient based techniques to generate saliency maps for both the time dimension and the features. The human virus detection method [8] used DeepLIFT [59] to extract sub-sequences with highest contribution score and also visualized sequence logos based on mean partial Shapley values for each base at each position. The major drawback of this method is the extracted subsequences are not validated. Contributions In this paper, we first improve the predictive performance of viral genome sequences based on human metagenomic datasets. We demonstrate that the proposed explainable CNN model, EdeepVPP, and CNN-LSTM based model EdeepVPP-hybrid outperform both previous state-of-the-art deep learning models [42, 53, 65] and traditional machine learning models [14, 52, 57] . The EdeepVPP predicts viral sequences from novel samples with high accuracy, implying that the model can accurately predict unknown viral sequences, so it works like a recommendation system. Next, we propose a novel approach to make our models more transparent by extracting the underlying patterns that works like significant features for better classification. As a proof of concept, we validated the extracted patterns of EdeepVPP on human metagenomic datasets to the known patterns of HOCOMOCO [37] database. Unlike, most of the existing models, the proposed models are generalized, not limited to a particular family of viruses. In this section, we presented the classical introduction of CNN and LSTM, the architectures of proposed models, and metrics for evaluation. CNN is one of the architectures of the neural network, used to assess visual patterns with heterogeneity from the data. CNN, a special type of multilayer neural network, which maps a fixed length input to a fixed-size output and trains with a back-propagation algorithm [39] . It contains several layers, normally one input, several hidden, one output layer, and each layer contains several neurons, and each neuron has different parameters [73] . To switch from one layer to the next, CNN stores and updates information in its filter weights after learning the relationship between input and output. Filter weights are first initialized with random uniform and then updated by back-propagation to minimize a loss (or) cost function. We used a categorical cross-entropy as the loss function is shown as follows: In order to understand CNN, the non-linear activation function plays a crucial role after convolution. Sigmoid, Tanh, and ReLU (Rectified Linear Unit) are three commonly used non-linear activation functions. All these non-linear activation functions play a squashing operation. The sigmoid normalize the values between 0 and 1. The Tanh function normalizes the input values from -1 to 1. The ReLU changes negative values to zero and positive values keep the same. ReLU is simply a half-wave rectifier, the most prominent non-linear function. In contrast with sigmoid and tanh(z) functions, ReLU learns much faster. The output dense layer leverages the softmax activation to measure the likelihood for each class in prediction problems. To detect patterns as features, CNNs include multiple number of filters that slide over a one-hot encoded binary vector for a sequence. LSTMs are Recurrent Neural Network (RNN) architectures, useful to capture long term dependencies as they contain memory units which helps to remember (forget) the important (unimportant) features. It consists of a cell, an input, an output, and forget gates. The three gates monitor the flow of information into and out of the cell, and the cell remembers values over arbitrary time periods. The aim of the LSTM first layer (forget layer) is to determine which information from the cell state will be discarded. The next step consists of two parts, the sigmoid layer (input gate layer) decides the values to be updated. Whereas, the tanh layer generates candidate values in the form of a vector which is added to the state. In the update state, the new cell state is updated with forget and input gate vectors. In the output gate layer, decides which part of the cell state to be contained. Finally, the hidden state is calculate using output gate and current cell state. Majorly the LSTM is utilized to overcome long term dependency problem. The EdeepVPP architecture comprises one input, three one dimensional convolutions, one flattened, two dense, and one output layers that is shown in Fig. 1 .A. We use conv1D that is powerful in fixed-length sequences and also in time series data for deriving visual patterns. Conv1D performs simple array operations, so the computational complexity is significantly lesser than conv2D. Due to the low computational complexities, conv1Ds are well-suited for real-time applications especially on hand-held devices like mobiles and notebooks [35] . It has been observed that gradients of the cost functions approach zero as the depth of the neural networks increases. It is difficult to train a model if the weights do not change significantly, since the weights never converge. This type of problem is called a vanishing gradient [45] that is resolved by the non-linear activation function of the ReLU [28] . The drop-out and max-pooling layers are followed by convolutional layers. Flatten converts the output of the third convolution layer to 1D array which serves as an input to the next fully connected layer. The convolution operation is a vital step in CNN. The first convolutional layer convolves the one-hot encoded input with 32 filters which are slide across the input genome. The filter of size 7 is stride one position at a time and the padding is set to be as 'same' to preserve the actual size (300) of the input. These learned filters used to identify the particular patterns as features in the DNA sequence. In each convolution operation, the encoded input genome convolves with a number of k filters F={f 1 , f 2 , ..... f K }, and biases B = {b 1 , b 2 , ..... b K } are added, and each filter generates separate feature map M l k [35] . where f l−1 ik is learned filter weights at previous layer l-1, M l k is the value after convolution operation. The non-linear activation transformation σ (.) is applied to feature maps and the same process repeated to all convolution layers. ReLU activation given in equation (2), follows convolution layer to apply max(0,z) operation element-wise to generate feature maps.A dropout layer with a dropout rate of 0.2 precedes the activation layers of the ReLU, which randomly drops 20 percent of the connections in each iteration to provide regularization and to minimize over-fitting [63] . Dropout layers are accompanied by max-pooling layers of pool size 2 and slide is 2. Max-pooling calculate the maximum value for each adjacent feature map values, which is used for smoother feature activations [4] . By down-sampling process, the max-pooling decreases spatial dimensions and cost of computation and also extracts genome characteristic features and passes them on to the dense layers. The first convolutional layer extracts the global characteristics along with dropout and max-pooling layers. The second and third layers are also accompanied by layers of dropout and max-pooling which extract local characteristics in the same order as the first convolution followed. The Table 1 shows the structure of the proposed model in detailed manner. After completion of all stack of layers, the output of the last pooling layer transformed to ne dimensional vector and pass on to the classifier part of the model. In this part, there are two dense layers with 32 neurons in the first layer and 2 neurons in the next layer. A dropout layer is used between two dense layers. The last dense layer has a softmax activation function, which produces two probabilities belongs to either positive (true) or negative (false) target classes. A softmax function mathematically represented as below: Finally, the genome sequence is classified as viral/non-viral type based on the output probability. The categorical-cross entropy loss function is given in the equation (1). After every epoch the filter weights are updated to minimize the loss function.We used Keras [27] , a minimalistic, highly modular neural network library, written in Python, in our implementation of the network. The EdeepVPP-hybrid model consists one embedding, four one dimensional (1D) convolutions, three LSTM, two dense, and one output layers that is shown in Fig. 2 . The DNA sequences consist of 5 nucleotides (A, T, C, G, N) and each nucleotide is represented as an integer. We The proposed CNN model was assessed by using two popular classification performance metrics i.e., AUC-ROC and AUC-PR. To calculate these metrics precision (Prec), Specificity (Sp), Recall or Sensitivity (Sn),True Positive Rate (TPR), False Positive Rate (FPR), and Accuracy (Acc) are required. P rec = T P T P + F P T P R(or)Sn = T P T P + F N The TP, TN, FP, FN are the number of true positive, true negative, false positive, and false negative values respectively. Accuracy, sensitivity and specificity are sensitive to the dataset class distribution, because there are very less viral sequences than non-viral. Since the datasets used in the proposed work are extremely imbalanced, consisting of a variable number of sequences that are viral and non-viral. The AUC-ROC is better suited when both viral and nonviral samples are in similar proportions. The majority of samples would have a greater impact on the curve than the minority, which could contribute to bias. On the other hand, a precision-recall curve is largely used for the class of imbalanced problems since it does not recognize false positives and false negatives, so there is no risk of impact of majority samples, thereby providing adequate assessment. In this section, at first we have given an overview of metagenomic datasets used in the experiments, and preprocessing of the data. Next, an overview of cross-validation and the settings used for training in the EdeepVPP model are discussed. We have collected 19 different metagenomic contig experiments, called human metagenomic datasets derived from human samples, to bespeak the EdeepVPP prediction model and to validate the model output. These datasets belongs to various patient groups, generated from nextgeneration sequencing, which are analyzed and labeled by PCJ-BLAST [47] . We have collected 300bp contigs for our experiments that have been described in detail in [65] . The details of human metagenomic datasets are given in the Table 2 . Cross-validation is a model quality evaluation method, which is better than the residual evaluation approach, useful to avoid overfitting and underfitting. K-fold cross-validation randomly breaks the samples of the dataset into k folds or groups of approximately equal size. Iteratively, k-1 folds at a time used as a test set and the model is tested on remaining one fold. We adopt a standard strategy to select Different hyper-parameters are tuned by learning the model and the best values for parameters are selected on the basis of less validity loss. We performed random search to select optimal values for hyper-parameters. The tuned hyper-parameters are CNN filter sizes, Learning rate, and the dropout ratio, activation function and so on. The search space and the selected values for these hyper parameters are shown in the Table 3 . In each fold, the neural networks was trained for only 6 epochs. In this section a systematic description of proposed models' capability is compared with the state-of-the-art methods. We have tested our system with human metagenomic datasets in order to present the ability. We have observed that the human metagenomic dataset contains viral and non-viral samples that are greatly imbalanced. We shuffled these dataset sequences and created the balanced and imbalanced sets. The viral and non-viral sequences are kept in a 1:1 ratio in balanced datasets because the non-viral sequences are large in number and are selected randomly from the available sequences. Firstly, we trained the model for each human metagenomic experiment individually by using 10-fold cross-validation with 6 epochs only. Second, a 10-fold cross-validation is used on a human metagenomical dataset Fig. 3 . Finally, we merged all five serum datasets called human serum dataset and performed 5-fold cross validation. The proposed model achieved 0.991 AUC-ROC on human serum viral dataset. The EdeepVPP model achieves AUC-PR values of 0.9881 and 0.9872 on human metagenomic and human serum datasets, respectively shown in Fig. 4 . For human viral metagenomic datasets, the mean performance is stated in terms of AUC-ROC, AUC-PR, and the results are shown in Table 4 . From the prevailed results, it has been stated that the values of AUC-ROC for balanced and imbalanced datasets are 92.41% and 98.81% respectively and for AUC-PR these values are 92.13% to 98.72% respectively. In particular, the imbalanced datasets have achieved high accuracy, which improves the discriminatory ability of the EdeepVPP model. For individual datasets the average AUC-ROC ranges To further validate the discriminative ability of Edeep-VPP, we also construct Leave-One-Experiment-Out cross-validation (LOEOCV) on the human serum dataset. The human serum dataset contains 5 metagenomic experiments derived from serum type. We trained a specific EdeepVPP model based on four metagenomic experiments data and used the remaining one to test the model. If Edeep-VPP gets good performance on LOEOCV, then it implies that the EdeepVPP model predicts the novel unknown viral sequences. For each metagenomic experiment, we conduct leave-one-experiment-out evaluation process. We compared the LOEOCV results of EdeepVPP with the results of state-of-the-art existing approach ViraMiner [65] that were shown in Table 5 . The results of the leave-one-out evaluation of EdeepVPP were better than the standard results of ViraMiner. These results conclude that the proposed model can better predict the novel sequences that are not involved in the training set. Our model was train with divergent metagenomic contig sequences that are derived from different sample types such as skin, prostate secretion, serum, and cervix tissues. The It is possible to describe interpretation as the degree of comprehension of what a model does. In recent years , high precision for biological sequence classification has been provided by CNN models with complex internal implementation. In these models, a lot of effort has been made to improve the efficiency of the prediction. Acceptance of the CNN model depends not just on efficiency, but also on how the user can perceive the underlying mechanism. However, CNN models produce excellent predictive results, but they are regarded as "black boxes" due to their complex structure. When dealing with problems with the classification of biological sequences, there is a great need for interpretability to ensure the accuracy of decisions taken by CNN models. There is a need to eliminate this 'block box' nature and provide the transparent models to show the underlying feature extraction process. Discrimination and perception are two major advantages of the proposed model. EdeepVPP, an interpretable CNN system, is capable of extracting features for predicting viral genomes as shown in Fig. 1B. After completion of training, the filters become learned filters, which means filter updates weights to optimum values. We have developed a computational step-by-step approach to identify the underlying patterns that guide the prediction of viral sequences as shown in Fig. 6 . For the viral genome prediction, we have extracted the motifs, which are having higher activation values than the threshold (half of the highest activation value) from human metagenomic dataset. The activation value of the pattern depends on the work performed by different filters over the input sequences. Filter jobs can catch highly important structures. We have considered one motif from each filter, which has the highest activation value. The Table 6 gives the patterns one from each filter and corresponding activation values, which are highly influential patterns for detection of viral sequences in the human dataset. The motifs ACGACCG, ACGCAGT are extracted by filters 11 and 16 are biologically relevant motifs for the identification of viral sequences. We performed interpretability on human serum, and human metagenomic datasets. We sorted patterns by frequency and extracted top-five patterns for viral and nonviral patterns in each filter, and also top-50 patterns for viral and non-viral in overall 32 filters from human serum, and human metagenomic datasets. The top-five patterns in each filter match the patterns of the other filters. Table 7 Extract the learned filter weights from first convolution layer of the proposed model after completion of all epochs. The test sequences are passed through the EdeepVPP model and mapped the extracted filters to generate feature maps. Calculate the mean/half of highest activity value from feature activations and set as a threshold to extract the important features. Spot all the positions in feature map where activations are greater than the threshold value. In each filter level, backtrack the spotted positions to the input sequences and extract the important patterns 3-nucleotides upstream and 3-nucleotides downstream including spotted position Store the extracted patterns along with filter number, mean and feature map value for all the test sequences. Finally, calculate the frequency of all patterns on each filter level and overall feature map level. Table 8 . The top-50 patterns from all filters are extracted from each viral and non-viral sequences of the human metagenomic dataset. We found that some of the patterns are common for both viral and non-viral sequences that are considered as neutral as their role in classifying viral sequences are negligible. The viral patterns (except neutral) grouped and position weight matrix (PWM) is determined by measuring nucleotide frequency. PWMs are used to generate the sequence logos using WebLogo3 [16] . In the same way, the viral patterns are extracted from the human serum dataset. The logos of human metagenomic and human serum datasets are shown in Fig. 7 (a) and (b) represents a motif TTTTAAT and TAAATAT respectively. In Fig. 7 (a) and (b) the size of a base in the pattern indicates the frequency probability of the corresponding nucleotide at a particular position. The MEME-Suite [6] motif comparison tool TOMTOM [29] compares one or more motifs against annotated motifs from existing databases (e.g. the database JASPAR, Human and Mouse (HOCOMOCO)). To evaluate EdeepVPP's capability, We compared the extracted patterns of our model on human metagenomic and human serum dataset to the known patterns of HOCOMOCO [37] . We note that the learned patterns of EdeepVPP matched a large number of significant known patterns. Both the human metagenomic and human serum datasets learned motifs that were matched to 44 and 61 existing motifs, indicating that the interpretable strategy suggested in this paper establishes transparency to divulge hidden features for better classification. Figure 7(a.1, a. 2) shows human metagenomic matched motifs and Fig. 7(b.1, b. 2) shows human serum matched motifs in the HOCOMOCO database. Filters are qualified weight vectors that play a crucial role in the identification of patterns (motifs) for classification problems. In addition to the pattern extraction, we also evaluate the robustness of EdeepVPP convolved first layer filters. Visualization of human metagenomic and human serum datasets of the first filters shown in Fig. 8 , by using displayr. The X-axis of the heatmaps indicates the location To visualize a filter of size Lx5 (L is the filter length, i.e. 7 in the first convolution layer) that contains learned weights, a heatmap is used to demonstrate the significance of bases at each place. If the extracted patterns have higher mean activation values for the filters, then the patterns have a greater impact on determining the viral sequence as true. The darker colours reflect a greater contribution of the nucleotide to that specific role. The prediction of the viral genome plays a vital role in the study of complex diseases. We introduced two deep learning models, the first one is EdeepVPP, an interpretable CNN model for pattern (motif) extraction, which predicts true and pseudo viral sequences. This model consists of stack of convolution + pooling layers taking DNA sequences as an input and generating probabilities for true and false viral sequence classification as an output. The EdeepVPP module performs two tasks which are novel viral prediction and interpretability. The second model, EdeepVPP-hybrid consists of CNN and LSTM layers to identify viral genomes effectively. To evaluate the skill of the proposed models, we used 10-fold, 5-fold, leave-oneexperiment-out cross-validations. The performance metrics AUC-ROC, AUC-PR are used to evaluate and compare with state-of-the-art techniques. These models outperformed all the existing viral sequence classification methods on human metagenomic and human serum data sets. Model interpretability involves three tasks, the detection of the most important patterns that lead to the identification of true and false viral sequences, the ability of the learned filter to detect these patterns, and validation of these patterns with known patterns of HOCOMOCO database. Both the human metagenomic and human serum datasets learned motifs that matched a large number of existing motifs, indicating that the interpretable strategy proposed in this work establishes transparency to reveal the hidden features for better classification. In addition, the EdeepVPP models can be expanded to predict various viral diseases such as COVID-19. We assume that the proposed CNN model EdeepVPP is capable of extracting essential features, recognizing possible viral sequences, and discovering viralassociated sequence patterns. A standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences Basic local alignment search tool Marvel, a tool for prediction of bacteriophage sequences in metagenomic bins Deep learning for computational biology Explainable deep neural networks for multivariate time series predictions Meme suite: tools for motif discovery and searching Interpretable detection of novel human viruses from genome sequencing data Interpretable detection of novel human viruses from genome sequencing data NAR Phylogenetically diverse tt virus viremia among pregnant women Viremia during pregnancy and risk of childhood leukemia and lymphomas in the offspring: Nested case-control study Unbiased approach for virus detection in skin lesions Deep sequencing extends the diversity of human papillomaviruses in human skin Extension of the viral ecology in humans using viral profile hidden markov models Machine learning for detection of viral sequences in human metagenomic datasets Gene expression inference with deep learning Weblogo: a sequence logo generator Explainable artificial intelligence (xai) approaches and deep meta-learning models Deep dynamic models for learning hidden representations of speech features. In: Speech and audio processing for coding, enhancement and recognition Search and clustering orders of magnitude faster than blast Dndisorder: predicting protein disorder using boosting and deep networks High throughput sequencing reveals diversity of human papillomaviruses in cutaneous lesions Explainable and interpretable models in computer vision and machine learning Clonal integration of a polyomavirus in human merkel cell carcinoma Human skin microbiota: high diversity of dna viruses identified on the human skin by high throughput sequencing Deeppolya: a convolutional neural network approach for polyadenylation site prediction Deep sparse rectifier neural networks Quantifying similarity between motifs genome biology Computational prospecting the great viral unknown Metagenomic sequencing of "hpv-negative" condylomas detects novel putative hpv types Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks Taxonomic classification for living organisms using convolutional neural networks Ecology of viruses in soils: past, present and future perspectives ) 1d convolutional neural networks and applications: A survey Imagenet classification with deep convolutional neural networks Hocomoco: towards a complete collection of transcription factor binding models for human and mouse via large-scale chip-seq analysis Previously unknown and highly divergent ssdna viruses populate the oceans Deep learning Training interpretable convolutional neural networks by differentiating class-specific filters A critical review of recurrent neural networks for sequence learning Rnn-virseeker: a deep learning method for identification of short viral sequences from metagenomes Next-generation sequencing of cervical dna detects human papillomavirus types not detected by commercial kits Challenges in homology search Hmmer3 and convergent evolution of coiled-coil regions Why are deep neural network hard to train Neural networks and deep learning Disease-specific alterations in the enteric virome in inflammatory bowel disease Massively parallel implementation of sequence alignment with basic local alignment search tool using parallel computing in java library Automated subtyping of hiv-1 genetic sequences for clinical and surveillance purposes: performance evaluation of the new rega version 3 and seven other tools An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in hiv-1 Explainability methods for graph convolutional neural networks Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences Virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data Identifying viruses from metagenomic data using deep learning Identifying viruses from metagenomic data by deep learning Gut dna viromes of malawian twins discordant for severe acute malnutrition A deep learning approach to dna sequence classification Virsorter: mining viral signal from microbial genomic data Long short-term memory Learning important features through propagating activation differences Explainable deep learning models in medical image analysis Profile hidden markov models for the detection of viruses within metagenomic sequence data A deep learning network approach to ab initio protein secondary structure prediction Dropout: a simple way to prevent neural networks from overfitting Sequence to sequence learning with neural networks Viraminer: Deep learning on raw dna sequences for identifying viral genomes in human samples Metagenomics-a guide from sampling to data analysis Newly discovered ebola virus associated with hemorrhagic fever outbreak in uganda Deepcnf-d: predicting protein order/disorder regions by weighted deep convolutional neural fields Case studies of the spatial heterogeneity of dna viruses in the cystic fibrosis lung An image representation based convolutional network for dna classification Interpretable convolutional neural networks A deep learning framework for modeling structural features of rna-binding protein targets Splicerover: interpretable convolutional neural networks for improved splice site prediction Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.