key: cord-0617755-4pznmkq9 authors: Aakur, Sathyanarayanan N.; Narayanan, Sai; Indla, Vineela; Bagavathi, Arunkumar; Ramnath, Vishalini Laguduva; Ramachandran, Akhilesh title: MG-NET: Leveraging Pseudo-Imaging for Multi-Modal Metagenome Analysis date: 2021-07-21 journal: nan DOI: nan sha: 24be317c55581c4d6cebed0f564569e37b72f072 doc_id: 617755 cord_uid: 4pznmkq9 The emergence of novel pathogens and zoonotic diseases like the SARS-CoV-2 have underlined the need for developing novel diagnosis and intervention pipelines that can learn rapidly from small amounts of labeled data. Combined with technological advances in next-generation sequencing, metagenome-based diagnostic tools hold much promise to revolutionize rapid point-of-care diagnosis. However, there are significant challenges in developing such an approach, the chief among which is to learn self-supervised representations that can help detect novel pathogen signatures with very low amounts of labeled data. This is particularly a difficult task given that closely related pathogens can share more than 90% of their genome structure. In this work, we address these challenges by proposing MG-Net, a self-supervised representation learning framework that leverages multi-modal context using pseudo-imaging data derived from clinical metagenome sequences. We show that the proposed framework can learn robust representations from unlabeled data that can be used for downstream tasks such as metagenome sequence classification with limited access to labeled data. Extensive experiments show that the learned features outperform current baseline metagenome representations, given only 1000 samples per class. Advances in DNA sequencing technologies [21, 22] have made possible the determination of whole-genome sequences of simple unicellular (e.g., bacteria) and complex multicellular (e.g., human) organisms at a cheaper, faster, and larger scale. The abundance of collected genome sequences require reliable and scalable frameworks to detect novel pathogens and study further mutations of such pathogens to mitigate threatening disease transmissions. Zoonotic diseases, like SARS-CoV-2, are a prime example for the need for rapid learning from noisy and limited data, due to their ability to mutate and cause pandemic situations. To this end, DNA sequencing-based approaches, such as metagenomics, have been explored by several researchers [7, 11] for plant and animal disease diagnostics. Metagenome-based diagnostics are pathogen agnostic and theoretically have unlimited multiplexing capability. Unlike traditional methods, metagenome-based diagnostics can also provide information on the host's genetic makeup that can aid in personalized medicine [10] . However, metagenome diagnostics encounter the problem of long-tail distribution of pathogen sequences in the data. The problem aggravates for pathogen detection tasks when we consider pathogens from the same genus. For example, Mannheimia haemolytica and Pasteurella multocida share as much as 95.5% of their genome. In this work, we consider the Bovine Respiratory Disease Complex (BRD) as a model and aim to detect the presence of six associated bacterial pathogens, namely Mannheimia haemolytica, Pasteurella multocida, Bibersteinia trehalosi, Histophilus somni, Mycoplasma bovis, and Trueperella pyogenes. One of the major challenges in metagenome-based diagnostics is the need for specialized bioinformatics pipelines for analysis of the enormous amounts of DNA sequences for detecting disease markers [7, 11] . Machine learning research offers several opportunities to analyze DNA sequences collected directly from environment samples [6] . Deep learning models, in particular, have been explored for representation learning from metagenome sequences for many associated tasks including, but not limited to: capturing simple nucleotide representations with reverse complement CNNs and LSTMs [4] , depth-wise separable convolutions to predict taxonomy of metagenome sequences [5] , genomic sub-compartment prediction [1] and disease gene predictions [13] with graph representations, predicting taxonomy of sequences by learning representations with bidirectional LSTMs with k -mer embedding and self attention mechanism [18] , learning metagenome representations with ResNet [12] to predict the taxonomy. Pseudo-Imaging is often used in astrophysics [24] and medicine [27] to study objects/tissues by forming images in another modality using alternative sensing methods, which exhibit detailed representations compared to conventional imaging. In particular, pseudo imaging is widely used in medicine to obtain pseudo-CT estimations from MRI images [17] and ultrasound deformation fields [32] . Deep learning models are particularly suited to handling pseudo image data, due to the success of convolutional neural networks in computer vision research [12, 30] . However, there have been very few methods used for metagenome analysis. For example, Self-Organizing Maps (SOM) [26] and Growing Self-Organizing Maps (GSOM) [25] have been used to represent metagenome sequences as images and a shallow CNN model was used for disease prediction. A matrix representation of a polygenetic tree has been used with CNN to predict host phenotype of metagenome sequences [28] . In this section, we introduce our MG-NET framework for extracting robust, selfsupervised representations from metagenome data. Our approach has three major components: (i) capturing a global structural prior for each metagenome sequence conditioned on the metagenome structure, (ii) extracting local structural features from pseudo-images generated from metagenome sequences, and (iii) integrate local and global structural features in an integrated, attention-driven structural reasoning module for multi-modal feature extraction. The overall approach is illustrated in Figure 1 . We jointly model the global and local structural properties in a unified framework, which is trained in a self-supervised manner, without labels, to capture robust representations aimed for metagenome classification with limited and unbalanced labeled data. First, we construct a global graph representation of the entire metagenome sample, i.e., the graph provides a structural representation of the sequenced clinical sample. We take inspiration from the success of De Bruijn graphs for genome analysis [20, 23] and use a modified version to represent the metagenome sample. Given a metagenome sample X with sequence reads X 0 , X 1 , . . . X n ∈ X , we construct a weighted, directed graph whose nodes are populated by k-mers x j such that x 0 , x 1 , . . . x l ∈ X i . Each k-mer is a subsequence from a genome read X i of length k, extracted using a sliding window of length k and stride s. Each edge direction is determined by order of occurrence of each observed k-mer in the sliding window. The edge weights are iteratively updated based on the observation of the co-occurrence of the nodes and are a function of the frequency of co-occurrence of the k-mers. The updated weights are given by where e i,j is the current weight between the k-mer nodes x i and x j and e i,j is the new weight to be updated; f s (·) is a weighted update function that bounds the new weight within a given range. In our experiments, we bound the edge weights to be between -2 and 2 and hence set f s (q) = 2 max(q − 1, 1) + (min(q−2, 2)+2) to capture the relative increase in frequency to highlight structures that emerge through repeated co-occurrence while suppressing spurious links. The edge weights are initially set to 1. Given this structural representation, we extract features for each k-mer using node2vec [9] to capture the "community" or neighborhood structure of a k-mer to reject clutter due to observation noise [16] . The resulting representation x st i for each k-mer x i captures its neighborhood within a sequence and provides a structural prior over the metagenome structure. The global structural representation for a sequence X i is the averagepooled (AP) representations of each k-mer given by . . x st n }). The second step in our framework is to generate a pseudo-image I r for each metagenome sequence read X r ∈ X . The key intuition behind generating a pseudo-image is to represent and learn recurring patterns (or "fingerprints" [31] ) in metagenome sequence reads that belong to the same pathogen species automatically. For example, in Figure 2 , it can be seen that two sequences from the same pathogen Mannheimia haemolytica have recurring patterns across clinical samples. Inspired by the success of Gray Level Co-occurrence Matrix (GLCM) [3, 2] , we use a histogram-based formulation to provide a visual representation of a metagenome sequence. Instead of binary co-occurrence, we use relative co-occurrence to generate images with varying intensity. Each "pixel " p i,j ∈ I r is representative of the frequency of co-occurrence between the k-mers x i and x j in a sequence read. The resulting pseudo-image representation is given by where f (x i , x j ) is the relative co-occurrence between k-mers x i and x j computed using Equation 1; s is the stride length; N is the sum of all co-occurrences to scale the value between 0 and 1;λ min is a cutoff parameter to reduce the impact of noise introduced due to any read errors [16] . Note that the e(i, j) is computed per sequence read and not at the sample level as done in Section 2.1. This allows us to model sequence-level patterns and hence capture species-specific patterns. The depth of the image is set to be 3 and the pixel values are duplicated to simlate an RGB image and hence allow us to leverage advances in deep neural networks to extract automated features. In our experiments, we set λ min = 0. The third and final step in our framework is to use the global representations (Section 2.1) to help learn local structural properties from pseudo-images (Section 2.2) using attention as a structural reasoning mechanism. Specifically, we use a convolutional neural network (CNN) to extract local structural features from a given pseudo-image. As can be seen from Figure 1 , we use the intermediate (4 th convolutional block) layer of the CNN as a local feature representation X lc i of a sequence read X i . We obtain a robust representation by using attention-based reasoning mechanism given by where GAP refers to the Global Average Pooling function [19] and X g i is global structural feature representation provided by the global graph representation from Section (2.1). We train the network end-to-end by having a decoder block (a mirrored network of deconvolutional operations) to reconstruct the input pseudoimage from this structural representation. Note that our goal is to learn robust representations from limited labeled metagenome data rather than reconstruction or segmentation. Hence, adding skip connections like U-Net [29] will allow the network to "cheat" and not learn robust, "compressed" representations. We augment the features from the CNN with structural features from the node2vec representations using Equation 3, where we flatten the feature maps so that they match the dimensions for element-wise multiplication in the attention mechanism. We use L2-norm between the reconstructed and the actual pseudo-image as the objective function to train the network in a self-supervised manner. Formally, we define the loss function as L recons = I i − I i 2 , where I i and I i refer to the reconstructed and actual pseudo-images, respectively. We show empirically (Section 3) that the integrated reasoning during training enhances the performance as opposed to mere concatenation of auto-encoder and global features. We use a 4-layer convolutional neural network based on VGG-16 [30] as our feature extractor in Section 2.2. We mirror the network to have a 4-layer decoder network to reconstruct the pseudo-image. The network is trained end-to-end for 25 epochs with a batch size of 64 and converges in about 30 minutes on a server with an NVIDIA Titan RTX and a 32-core AMD ThreadRipper CPU. The extracted representations are then finetuned for 10 epochs with a 3-layer deep neural network for pathogen classification. The learning rate is set to 1 × 10 −4 for both stages and optimized using the standard gradient descent optimizer. Empirically, we find that having a k-mer length of 5 and stride of 10 provides the best results and present other variations in the ablation study (Section 3) for completeness. Default parameters were used for node2vec representations. All networks are trained from scratch during the pre-training phase. Data Collection. For constructing the dataset for the training and evaluation of automated metagenome-based pathogen detection, we collected metagenome sequences from 13 Bovine Respiratory Disease Complex (BRDC) lung specimens at a local (name redacted to preserve anonymity) diagnostic laboratory using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany). Sequencing libraries are prepared from the extracted DNA using the Ligation Sequencing Kit and the Rapid Barcoding Kit. Prepared libraries are sequenced using MinION (R9.4 Flow cells), and sequences with an average Q-score of more than 7 are used in the final genome. RScript MinIONQC [15] was used for quality assessment. Annotation and Quality Control. We used the MiFi platform [8] 3 for labeling metagenome sequence data. This platform is based on the modified version of the bioinformatics pipeline discussed by Stobbe et al. [31] . Using MiFi, unique signature sequences referred to as e-probes were developed for the pathogen of interest. These e-probes were then used to identify and label pathogen specific sequences in the metagenome reads, and differentiate them from other sequences (host, commensals, and other pathogen sequences). Clinical metagenome sam- ples from 7 patients were used for training, 1 for validation, while sequences from 5 patients were used for evaluation. Metrics and Baselines. To quantitatively evaluate our approach, we use precision, recall, and the F-score for each class. We do not use accuracy as a metric since real-life metagenomes can be highly skewed towards host sequences. Precision and recall, on the other hand, allow us to quantify the false alarm rates and provide more precise detection accuracy. We compare against other representation learning frameworks for metagenome analysis proposed, such as graph-based approaches [23] , and a deep learning model termed Seq2Vec, an endto-end sequence-based learning model based on DeePAC [4] . For classification, we consider both traditional baselines such as logistic regression (LR), support vector machines (SVM), and multi-layer perceptron (MLP), as well as a deep neural network (DL). The MLP baseline has two hidden layers with 256 neurons each, while the deep learning baseline has 3 hidden layers with 256, 512 and 1024 neurons each with a ReLU activation function. We choose the hyperparameters for each of the baselines using an automated grid search and the best performing models from the validation set were taken for evaluation on the test set. We evaluate our approach and0 report the quantitative results in Table 1 and Table 2 . We evaluate under different settings to assess the robustness of the proposed framework under limited data and limited supervision. We also compare against comparable representation learning approaches to highlight the importance of attention-based reasoning to integrate global and local structural information in a unified framework. We significantly outperform all baselines when training with the entire training data and offer competitive performance when fine-tuned with only 500 labeled samples per class. Effect of Limited Labels. Since our representations are learned in a selfsupervised manner, we also evaluate its metagenome recognition performance with limited labeled data and summarize results in Table 1 . First, we evaluate when there is no labeled data using k-means clustering to segment the features into groups and align the predicted clusters with the ground-truth using the Hungarian method, following prior works [14] . It can be seen that we perform reasonably well considering we do not use any labels from the ground-truth to train. As expected, the performance gets better with the use of increasing amounts of labeled data. It is interesting to note that we match the performance of fully supervised models like sequence-level graph-based representations [23] and end-to-end deep learning models like Seq2Vec with as little as 500 samples and outperform their performance with as little as 1000 labeled samples per class. Given the performance of the linear classifier (Table 2 ), we can see that our approach learns robust representations with limited data. Comparison with Other Representations. We compare our representations with other baselines and summarize the results in Table 2 . We use a mix of traditional approaches (MLP, SVM, and LR) and a deep neural network (DL). We also train a linear classifier on top of each of the representations to assess their robustness. As can be seen, the representations from MG-Net outperform all other baselines by a significant margin. In fact, a linear classifier outperforms all other baseline representations that use deep neural networks. Our MG-Net features with a deep learning classifier achieve an average pathogen F-score of 63% and a host F-score of 98%, which are significantly higher than baseline representations. Evaluation with 5-fold cross validation (see supplementary material) corroborate the results. It is interesting to note that both the graph kernels and Seq2Vec use sequence read-level features, and our approach with only an autoencoder from Table 3 has comparable performance indicating that the global structural features have a significant impact on the performance. Ablation Studies. Finally, we systematically evaluate each component of the framework independently to identify their contribution. Specifically, we provide ablations of our approach with using only image features (autoencoder only), image+structural features without the MG-Net architecture (Autoencoder + Structural Priors), and only structural features (from node2vec). From Table 3 , we can see that the use of structural priors greatly improves the performance. We remove the structural prior and use an autoencoder trained on only the pseudo-images (Autoencoder Only) and use a late fusion strategy (Autoencoder + Structural Priors) to evaluate the structural reasoning module. While the performance is better than other baselines, the final MG-Net architecture outperforms all variations. Finally, we also vary the length of k-mer sequences and stride lengths and see that the performance increases with an increase in stride lengths while reducing the k-mer length reduces the performance. In this work, we presented MG-Net, one of the first efforts to offer a multi-modal perspective to metagenome analysis using the idea of co-occurrence statistics to construct pseudo-images. A novel, attention-based structural reasoning framework is introduced to perform multi-modal feature fusion, allowing for joint optimization over multiple modalities. Extensive real-world clinical data experiments show that the learned representations outperform existing baselines by a significant margin and offer a way forward for metagenome classification under limited resources. We aim to leverage these results to build automated diagnosis and intervention pipelines for novel pathogen diseases with limited supervision. Graph embedding and unsupervised learning predict genomic subcompartments from hic chromatin interaction data A combined radio-histological approach for classification of low grade gliomas A query-by-example content-based image retrieval system of non-melanoma skin lesions. In: MICCAI International Workshop on Medical Content-Based Retrieval for Clinical Decision Support Deepac: predicting pathogenic potential of novel dna with reverse-complement neural networks A deep learning approach to pattern recognition for short dna sequences Opportunities and obstacles for deep learning in biology and medicine Clinical metagenomics Microbe finder (mifi®): Implementation of an interactive pathogen detection tool in metagenomic sequence data node2vec: Scalable feature learning for networks The path to personalized medicine A metagenomics-based diagnostic approach for central nervous system infections in hospital acute care setting Deep residual learning for image recognition Humannet v2: human gene networks for disease research Invariant information clustering for unsupervised image classification and segmentation Minionqc: fast and simple quality control for minion sequencing data Assessing the performance of the oxford nanopore technologies minion Generation of pseudo-ct using high-degree polynomial regression on dual-contrast pelvic mri data Deepmicrobes: taxonomic classification for metagenomics with deep learning Network in network Assembly of long error-prone reads using de bruijn graphs Sequencing technologies-the next generation A first look at the oxford nanopore minion sequencer Gradl: A framework for animal genome sequence classification with graph representations and deep learning Pseudo imaging. In: Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XII Growing self-organizing maps for metagenomic visualizations supporting disease classification Metagenome-based disease classification with deep learning and visualizations based on self-organizing maps Tracking brain deformations in time sequences of 3d us images Popphy-cnn: a phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data U-net: Convolutional networks for biomedical image segmentation Very deep convolutional networks for large-scale image recognition E-probe diagnostic nucleic acid analysis (edna): a theoretical approach for handling of next generation sequencing data for diagnostics Research on pseudo-ct imaging technique based on an ultrasound deformation field with binary mask in radiotherapy This research was supported in part by the US Department of Agriculture (USDA) grants AP20VSD and B000C011.We thank Dr. Kitty Cardwell and Dr. Andres Espindola (Institute of Biosecurity and Microbial Forensics, Oklahoma State University) for providing access and assisting with use of the MiFi platform.