key: cord-0912142-196qtupj authors: Sokhansanj, Bahrad A.; Rosen, Gail L. title: Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences date: 2022-03-21 journal: mSystems DOI: 10.1128/msystems.00035-22 sha: 059f0937c332d7f522dcb1259592ef43ec991cd0 doc_id: 912142 cord_uid: 196qtupj Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces “black box” models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control. SARS-CoV-2 has spent the first 2 years of the pandemic rapidly evolving in ways that have had a big impact on virulence, transmission, and ability to evade our immune responses (11) . SARS-CoV-2 is an RNA virus, so its genome is prone to mutate-albeit at a rate mitigated by its large genome size and the proofreading function of its exoribonuclease (12) . The most frequent mutations observed in coronaviruses are generally substitutions, although insertions and deletions are observed as well (13) . In some cases, insertions from other viral genomes may occur, and, in fact, it appears as though the SARS-CoV-2 genome includes an insertion from human RNA (14) . In other human coronaviruses, the estimated mutation rate is around 3 Â 10 24 substitutions per site per year (15, 16) . The amount of mutation observed during the COVID-19 pandemic has been even more substantial than expected (17) . An early estimate of SARS-CoV-2 mutation was 6 Â 10 24 substitutions per site per year (18) . But the disease has spread widely around the world since then, and novel variants transmit more quickly-increasing the opportunities for the virus to mutate (19, 20) . The SARS-CoV-2 spike protein will continue to change in the future. Studies on another human coronavirus, HCoV-OC43, suggest that genetic drift plays a role in coronavirus adaptive evolution (21) . One study estimates that as of July 2021, SARS-CoV-2 had only "explored" 31% of the potential space for spike gene variation, based on comparisons with related sarbecoviruses (22) . The first widely used tool for tracking SARS-CoV-2 genomic variation was the Nextstrain project, https://nextstrain.org. Nextstrain, originally developed as a general tool for viruses, was adapted to offer clade definitions for SARS-CoV-2 based on phylogenetic analysis (23) . Phylogenetic tree reconstruction has been effective in inferring viral origins and trace transmission changes but not as useful in classifying genomes because the virus can accumulate and drop mutations in parallel across clades and subclades (24) . The Pango nomenclature (https://cov-lineages.org/), developed specifically for SARS-CoV-2, has largely supplanted Nextstrain clade definitions (25) . New sequences are assigned to Pango classifications, called "lineages," using the Random Forests classification algorithm. A new Pango lineage is defined when a sufficient number of viral sequences emerges with a phylogenetic dissimilarity from existing sequences above a set threshold (26) . Particularly significant Pango lineages have been identified by the World Health Organization (WHO) as variants of concern (VOC), which are given Greek letter designations (27) While Pango lineages appear clear and well-defined, the reality is that the genome is much more fluid. If we want to understand how genome affects viral function, we cannot rely on traditional taxonomic categorization. As mutations recur, revert, and proliferate, taxonomy hits its limits of utility (11) . As an initial matter, changes to SARS-CoV-2 properties often implicate combinations of multiple mutations that emerge simultaneously-and then sometimes revert in whole or in part as the virus continues to evolve (28, 29) . For example, one frequent spike protein amino acid substitution, N501Y, has appeared and reverted contemporaneously in multiple clades and lineages, with no evidence of recombination (30) . Simultaneous mutations can also have unpredictable, nonlinear effects, i.e., they can be synergistic, antagonistic, or fully independent (31). This complicates classical and Bayesian logistic regression methods for predicting fitness or protein function from mutations, as they rely on assuming the independence between mutations of individual amino acids or bases (32) . SARS-CoV-2 evolution is also highly nonlinear. Widespread lineages, such as Delta, have spawned complex sublineages with distinct immune evasion and virulence properties, which often genetically share more in common with distantly related lineages than their most recent ancestor (33, 34) . The increasingly complex evolutionary history of the virus stymies other proposed methods for genetically subtyping viral variants as well (35) (36) (37) . Further complicating the picture, some immunocompromised individuals can have chronic infections lasting 6 months to a year (38) . During long-term infection, a spike protein can emerge with multiple variations, which phylogenetic analysis identifies as "long branch" divergence from the phylogenetic tree (39) . Some long-term patients may even be treated with convalescent plasma or antibodies, which may select for immune evasive mutations (40) . The Omicron variant has such a long branch divergence, indicating that it may have emerged in an immunocompromised host or after incubating in a nonhuman host such as mice (41, 42) . How can we predict the virulence, fitness, antibody evasion, and other key properties of novel SARS-CoV-2 variants from complex, nonlinear changes in genetic sequence? Machine learning can tackle complex pattern recognition problems by training a model that can classify the organisms or genes by phylogeny or phenotype based on features of their genetic sequences. For example, we can extract k-mer (short subsequence) frequencies or other combinations of bases/amino acids and use them as features to train classifiers using naive Bayes classifier (NBC), support vector machines (SVM), decision tree-based methods, and neural networks (43) (44) (45) (46) (47) (48) (49) . Machine learning with k-mer features has been used for SARS-CoV-2 to identify genetic fingerprints of specific infections (50), classify variants (51, 52) , and train a model to predict the pathogenicity of unknown viruses (53) . Another approach is to build profile hidden Markov models (HMMs), which can identify taxonomic lineages and variants of viruses. HMMs have been used to align SARS-CoV-2 sequences and compare its spike protein to that of other coronaviruses (22, 54, 55) . Deep learning has emerged as an even more powerful and flexible tool to find patterns in large and complicated data sets (56) (57) (58) (59) . Deep learning models use multiple layers of neural networks to automatically extract and transform features during training (56) (57) (58) . We can borrow deep learning methods developed for natural language processing (NLP) to find patterns in sequence data, where bases and amino acids that make up genome and protein sequences are analogous to semantic relationships between the words that make up sentences (60) (61) (62) (63) . For example, one group of researchers has used concepts from semantic processing, e.g., the frequency of correlated words, to identify potential mutagenic sites in viruses including SARS-CoV-2 (64). An emerging approach to deep sequencing learning is to transform protein sequences to embeddings that reflect their semantic structure, using the BERT (bidirectional encoder representations from transformers) neural network architecture, which Google developed to handle natural language search (65) (66) (67) (68) . An example of this approach is k-means clustering of "ProtBERT" SARS-CoV-2 protein embeddings generated by pretraining a BERT model on millions of UniProt sequences, which can be used to identify mutational hot spots within the genome that may give rise to future variants (69) . A key goal for modeling is to predict the health risk of emerging variants before empirical data are available. To this end, our group has developed a deep learning model to predict patient outcomes for emerging sequence variants that takes into account patient demographics (70) . Others are working to integrate sequence learning with computational protein structure models. For example, one project combines models of cell receptor binding and immune epitope alteration with transformer-based deep learning models to predict the fitness advantage of mutations (71) . Deep learning has also been used to identify the relationship between protein sequence and function using data from deep mutational scanning, an experimental technique for massively parallel functional analysis of protein sequence site mutations (72, 73) . Using this approach, another project predicts the risk for emerging variants by using a neural network to predict infectivity and vaccine breakthrough in combination with protein structure and binding prediction to model antibody resistance (74) . Deep learning methods excel at identifying complex features within data that allow classification. But they have a major weakness. Deep learning relies on neural networks, and it is very hard to determine why a neural network makes a particular classification or prediction. Interpretable, or explainable, machine learning can fill this important gap (75, 76) . Interpretable machine learning is particularly important in bioinformatics, since explaining a model's predictions is critical to justify making high-stakes clinical or research decisions based on machine learning predictions (77, 78) . Accordingly, developers of deep learning approaches to SARS-CoV-2 should consider providing some functionality to interpret or explain predictions. Analytical tools for interpretability in deep learning include examining neural network structure through relevance propagation, activation difference propagation, sensitivity analysis, and saliency map methods (79) (80) (81) . Integrated gradients have been used to analyze RNA splicing models (82) . An increasingly popular approach is the "attention" mechanism originally developed for NLP (83, 84) . Attention can highlight important features in text processed by deep learning models (85) (86) (87) . The amount of "attention" at a position in a sequence correlates with the weight put on that position in a trained model, where high attention at a position implies potential significance. Architectures combining convolutional neural networks (CNNs) with attention have been used to identify sequence motifs for functional genomics, e.g., transcription factor binding site detection (88, 89) . Another group generated predictive models of adverse drug reactions based on chemical structures by combining attention with a CNN for each chemical property and structural feature in the model (90) . Our group has shown that attention in combination with a recurrent neural network-based sequence model can provide insight into taxonomic and phenotypic classification of microbial 16s rRNA sequences (91), as well as gene ontology classifications of protein sequences (92) . Recently, transformer-based architectures have emerged, like the aforementioned BERT (93) . Transformers are built on multiple attention modules ("heads"), which could be used for interpretability (94) . For example, one recent paper demonstrated how different attention heads attended to different aspects of a learning task to identify nucleotide motifs for promoter sequences (95) . However, attention cannot be inherently drawn out of transformers. Further processing steps are generally required to connect attention to specific linguistic features (96) . Our group recently applied a self-attention layer after a transformer as a way to more readily extract and visualize attention across the sequence and applied it to SARS-CoV-2 (70) . An important caveat is that, based on comparing attention to empirical evidence, attention does not necessarily imply explanation-at least in the sense of explaining precisely why a prediction took place (97) . Attention can only highlight features that the attention layer of the deep learning model weighted most heavily during training, so it may only weakly indicate the complete set of important features for a classification problem. Finally, we highlight three important data limitations that researchers should address. First, as Fig. 1 shows, there are serious global inequities in sequencing data, with the overwhelming majority of sequences coming from Europe and North America. GISAID has encouraged data sharing from developing countries by trading restrictions on republishing sequence information for access to that information (98) . But global sequencing resources are disparately available (99) . Even within Europe and the United States, racial and regional disparities in sequencing found in other surveys (100) hamper SARS-CoV-2 sequencing as well. Second, the task of interpreting sequencing data is complicated by insufficient sample metadata, making it difficult to understand how SARS-CoV-2 sequences affect patient outcomes, for example. In GISAID, most sequences only have information about a patient's age or gender (if available) and the location where the sample was collected. As of 7 January 2022, a little over 270,000 sequences (4%) of the nearly 6.9 million have any metadata for patient outcomes, and many metadata entries are unintelligible. Sequencing projects should be encouraged to collect and curate as much information as possible about the sample and meet minimum information standards for sequence metadata (101) . Third, sequencing errors can lead to spurious results. Quality control is critical to make sure that low-frequency sequence variants are real (102) . Sequences can pick up contaminants from other variants in the amplification process, leading to what appear to be recombinant variants but which are in fact simply artifacts (103) . G.L.R. received National Science Foundation (NSF) grants no. 1919691 and no. 2107108. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We gratefully acknowledge all data contributors, i.e., the authors and their originating laboratories responsible for obtaining the specimens and their submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, from which Fig. 1 was generated. COVID-19, the first pandemic in the post-genomic era SARS-CoV-2 mRNA vaccine design enabled by prototype pathogen preparedness Next generation sequencing of SARS-CoV-2 genomes: challenges, applications and opportunities GISAID's role in pandemic response GISAID: global initiative on sharing all influenza data -from vision to reality Data-driven analysis of amino acid change dynamics timely reveals SARS-CoV-2 variant emergence A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees CoV-spectrum: analysis of globally shared SARS-CoV-2 data to identify and characterize new variants COVID-19 Review Consortium. 2021. Pathogenesis, symptomatology, and transmission of SARS-CoV-2 through analysis of viral genomics and structure COVID-19 Review Consortium. 2021. Identification and development of therapeutics for COVID-19 The biological and clinical significance of emerging SARS-CoV-2 variants The variant gambit: COVID-19's next move Viral mutation rates Putative host origins of RNA insertions in SARS-CoV-2 genomes Novel and emerging mutations of SARS-CoV-2: biomedical implications Mosaic structure of human coronavirus NL63, one thousand years of evolution Overwhelming mutations or SNPs of SARS-CoV-2: a point of caution Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Risk of rapid evolutionary escape from biomedical interventions targeting SARS-CoV-2 spike protein CMMID COVID-19 Working Group, COVID-19 Genomics UK (COG-UK) Consortium. 2021. Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England Genetic drift of human coronavirus OC43 spike gene during adaptive evolution Unique protein features of SARS-CoV-2 relative to other sarbecoviruses Nextstrain: real-time tracking of pathogen evolution Genomics UK (COG-UK). 2021. The emergence and ongoing convergent evolution of the SARS-CoV-2 N501Y lineages A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool Editorial: revised World Health Organization (WHO) terminology for variants of concern and variants of interest of SARS-CoV-2 Reduced sensitivity of SARS-CoV-2 variant Delta to antibody neutralization SARS-CoV-2 variant evolution in the United States: high accumulation of viral mutations over time likely through serial founder events and mutational bursts Spreading of a new SARS-CoV-2 N501Y spike variant in a new lineage SARS-CoV-2 Genomic Surveillance Initiative. 2021. Impact of circulating SARS-CoV-2 variants on mRNA vaccineinduced immunity Analysis of 2.1 million SARS-CoV-2 genomes identifies mutations associated with transmissibility Spike protein evolution in the SARS-CoV-2 Delta variant of concern: a case series from Northern Lombardy Breakthrough infections of E484K-harboring SARS-CoV-2 Delta variant Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization Co-mutation modules capture the evolution and transmission patterns of SARS-CoV-2 Pitfalls of barcodes in the study of worldwide SARS-CoV-2 variation and phylodynamics Year-long COVID-19 infection reveals within-host evolution of SARS-CoV-2 in a patient with B cell depletion Intrahost evolution during SARS-CoV-2 prolonged infection Emergence of multiple SARS-CoV-2 antibody escape variants in an immunocompromised host undergoing convalescent plasma treatment Evidence for a mouse origin of the SARS-CoV-2 Omicron variant Where did 'weird' Omicron come from? Science Metagenome fragment classification using N-mer frequency profiles Machine learning applications in genetics and genomics Large-scale machine learning for metagenomics sequence classification Correct machine learning on protein sequences: a peerreviewing perspective Machine learning for detection of viral sequences in human metagenomic datasets Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses Identify DNA-binding proteins through the extreme gradient boosting algorithm Profiling SARS-CoV-2 mutation fingerprints that range from the viral pangenome to individual infection quasispecies Classifying COVID-19 variants based on genetic sequences using deep learning models A k-mer based approach for SARS-CoV-2 variant identification COVID-DeepPredictor: recurrent neural network to predict SARS-CoV-2 and other pathogenic viruses COVID-align: accurate online alignment of hCoV-19 genomes using a profile HMM Rational design of profile hidden Markov models for viral classification and discovery Deep learning A primer on deep learning in genomics Deep learning in next-generation sequencing Unsupervised protein embeddings outperform handcrafted sequence and structure features at predicting molecular function Deep learning: new computational modelling techniques for genomics Representation learning applications in biological sequence analysis Protein embeddings and deep learning predict binding residues for various ligand classes Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes Learning the language of viral evolution and escape BERT: pre-training of deep bidirectional transformers for language understanding Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing ProteinBERT: a universal deep-learning model of protein sequence and function Understanding mutation hotspots for the SARS-CoV-2 spike protein using shannon entropy and K-means clustering Interpretable and predictive deep modeling of the SARS-CoV-2 spike protein sequence Early computational detection of potential high risk SARS-CoV-2 variants Deep mutational scanning: a new style of protein science Neural networks to learn protein sequence-function relationships from deep mutational scanning data Omicron (B.1.1.529): infectivity, vaccine breakthrough, and antibody resistance Explainable AI: a review of machine learning interpretability methods Definitions, methods, and applications in interpretable machine learning Incorporating machine learning into established bioinformatics frameworks Interpretable machine learning for genomics Methods for interpreting and understanding deep neural networks Learning important features through propagating activation differences Deep inside convolutional networks: visualising image classification models and saliency maps Enhanced integrated gradients: improving interpretability of deep learning models using splicing codes as a case study Neural machine translation by jointly learning to align and translate Show, attend and tell: neural image caption generation with visual attention A neural attention model for abstractive sentence summarization Hierarchical attention networks for document classification Attention-based bidirectional long short-term memory networks for relation classification Genetic architect: discovering genomic structure with learned neural architectures. arXiv 1605 Deep motif: visualizing genomic sequence classifications Predicting adverse drug reactions through interpretable deep learning framework Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network Visualizing and annotating protein sequences using a deep neural network Attention is all you need BertViz: a tool for visualizing multihead self-attention in the BERT model Explainability in transformer models for functional genomics Attention is not only a weight: analyzing transformers with vector norms Attention is not explanation Scientists call for fully open sharing of coronavirus genome data Danish Covid-19 Genome Consortium. 2021. Global disparities in SARS-CoV-2 genomic surveillance Racial/ethnic disparities in genomic sequencing COVID-19 pandemic reveals the peril of ignoring metadata standards Assessment of SARS-CoV-2 genome sequencing: quality criteria and low-frequency variants Synthetic DNA spike-ins (SDSIs) enable sample tracking and detection of inter-sample contamination in SARS-CoV-2 sequencing workflows A cross-country database of COVID-19 testing