key: cord-0001962-ybd8hi8y authors: Dutilh, Bas E title: Metagenomic ventures into outer sequence space date: 2014-12-15 journal: Bacteriophage DOI: 10.4161/21597081.2014.979664 sha: 61d0aa9a148a88e0354e36508c925eeb87a6b685 doc_id: 1962 cord_uid: ybd8hi8y Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as “unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as “biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question. S equencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question. Metagenomics is the untargeted sequencing of genetic material isolated from communities of micro-organisms and viruses. These communities may be derived from bioreactors, environmental, clinical, or industrial samples; in short, from anywhere in our unsterile biosphere. The classical questions in metagenomics that are asked about the sampled microbial community are "Who is there?" and "What are they doing?." 1 Originally an approach to answer these classical questions, metagenomics as a field has made great progress in the past decade. Applications include the use of metagenomics for the discovery of novel genetic functionality, 2 for describing microbial ecosystems and tracking their variation, 3 in untargeted medical diagnostics and forensics, 4 and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. 5 Powered by advances in next-generation sequencing technology, metagenomics has the potential to venture beyond the limits of currently explored sequence space by sampling environmental microbes and viruses at an unprecedented scale and resolution. Quite literally, sequence space is defined as the multi-dimensional space of all possible nucleotide (or protein) sequences. 6 Sequence space contains n dimensions; one dimension per residue that can take one of 4 (or 20, for proteins) states, with a total volume of S4 n sequences when summed over all possible sequence lengths n. Evolution may have largely explored this space, 7 but it remains an open question how large the current biological sequence space is, i.e. the fraction occupied by extant life. Figuratively, and within the context of this paper, "outer sequence space" is the remainder of this biological sequence space waiting to be explored by science. Metagenomics has traditionally addressed the 2 classical questions listed above by aligning the sequencing reads in metagenomic data sets to a reference database containing known, annotated sequences. This allows the taxonomic and functional diversity of the sampled microbes to be described in terms of existing knowledge, allowing for straightforward interpretation of the results. However, a persistent concern in the analysis of metagenomes has been the unknown fraction, consisting of the reads Keywords: biological dark matter, crAssphage, human gut, human virome, metagenomics, metagenome assembly, unknowns that cannot be annotated by using database searches. The level of unknowns can range up to 99% of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database. 8 Unknowns exist for 4 reasons that are not unrelated. The first reason is technical. Due to limitations of some next-generation sequencing platforms and library preparation protocols, spurious sequences may be generated that do not reflect true biological molecules. These artificial sequences include artifacts due to the sequencing technology 9 and chimeras, i.e., sequences generated from separate genetic molecules derived from different organisms. Since chimeras frequently arise during PCR amplification, they are expected to be more abundant in environmental amplicon sequencing than in shotgun metagenomics, and can be detected using bioinformatic tools. 10 The second reason that unknowns exist is biological, as they reflect the enormous natural diversity of microorganisms that we are only beginning to unveil with metagenomics. This is both overwhelming and exciting, highlighting how much remains to be discovered in biology. This genetic diversity has been referred to as biological "dark matter," 11, 12 and is especially pronounced in viral metagenomes. 8 This issue can only be resolved by expanding reference databases, as exemplified by recent studies of one of the most studied microbial ecosystems: the human gut. The first metagenomic snapshots of the microbiota in the human gut were taken from 2 healthy adults, and revealed a high interindividual diversity and many unknowns. 13 To a large extent, these unknowns were resolved when a reference catalog was created based on the sequences in the gut metagenomes themselves, decreasing the percentage of unknowns from »85% to »20%. 14 Moreover, subsequent large scale sequencing efforts revealed that in fact, many people share a similar intestinal flora, regardless of whether these similarities are viewed as discrete enterotypes 15 or as gradients. 16 These results illustrate how unknowns can be depleted by expanding the databases with appropriate reference sequences. This not only requires increased sequencing effort of phylogenetically diverse isolates 17 or single cells, 11 but also mining of draft genomes from metagenomes, 18 sampled from microbial environments around the globe. 19 Thus, by mapping the global sequence space, we can provide reassurance that at least some level of sampling saturation can be achieved. For viruses, and particularly for bacteriophages, efforts to provide a denser sampling of sequence space are still lacking. The third reason that unknowns exist is methodological. Because the advances in DNA sequencing technology have greatly outpaced improvements in computer power, 20 bioinformatic approaches to analyze metagenomes often cut corners. For example, reference databases may be reduced to include only those references that are expected in the sample a priori. Moreover, read annotation may be limited to identifying almost exact sequence matches, as this can be computed much faster than if sequence variations needs to be taken into consideration in a permissive homology search. These issues lead to an inherent blind spot for discovering true novelty, such as sequences that are not expected in the sample, or organisms that have not been observed before. One way to, at least partially resolve this issue is by de novo assembly of the metagenome. Depending on the diversity of the sample, assembly can combine many short sequences (individual reads) into fewer, longer ones (assembled contigs). Reducing the number, and increasing the length of the sequences allows homology searches to be performed with more sensitive, computationally more expensive algorithms such as translated homology searches or profile searches, leading to more specific annotation and improved biological interpretation. Moreover, larger and more comprehensive reference databases can be used, allowing unexpected hits to be found. The fourth reason that unknowns exist is logistical. Most research projects that generate metagenomic sequencing datasets deposit the read files in large repositories, provide an accession number in the associated publication, and move on. It is not unlikely that many of these data sets, consisting of files sometimes gigabytes in size, are never looked at again. Thus, while a certain sequence may have been "seen" in a metagenome and is thus strictly no longer "dark matter," it will still not be recognized when it is observed again. Reidentification of this sequence would only be possible if the publishing researcher identified it as an interesting sequence in his or her (assembled) metagenome, and submitted it to a searchable database like Genbank. 21 Because GenBank maintains very high standards for the sequences it accepts, submission can be a tedious process that is rarely worthwhile for unknown metagenomic contigs. An in depth investigation of the unknowns is rarely within the scope of a research project, and those sequences are thus first ignored and later forgotten. This is a waste of valuable resources: time, money, and work. The metagenomes available in public databases should be better exploited and mined for common sequences. To facilitate this, it is critical that metadata annotations of the metagenomes include a detailed description of the samples and sequencing protocol. 22 Exploiting these datasets will allow us to create more comprehensive maps of sequence space, and greatly improve our understanding and interpretation of metagenomes. In the short term, ignoring the unknowns can facilitate the interpretation of a metagenome. Because a taxonomic or functional description cannot be provided, the classical questions in metagenomics are left unanswered for the unknown fraction of the metagenome, and concentrating on the annotated sequences leads to a more straightforward answer. However, unexpected or novel sequences are quickly overlooked, even if they represent highly abundant or widespread organisms. Thus, in the long term, stockpiling the unknown sequencing reads in badly accessible bulk sequence repositories can severely slow down research, the discovery of novel species, and the charting of biological sequence space. One striking example of a novel genome discovered among the unknown sequences is crAssphage, a bacteriophage whose genome uniquely aligned sequencing reads from 73% of the 466 analyzed human gut metagenomes, and constituted a total of 1.68% of those metagenomic reads. 23 Like many bacteriophages, its genome sequence is highly divergent from everything that was present in the annotated part of the Genbank database, which is why it was not observed before. It has been suggested that the unknown fraction of metagenomes is enriched for viral sequences, 8, 24 because viral genomes are thought to evolve more rapidly than the genomes of cellular organisms, allowing them to explore a larger region of sequence space in the same amount of time. To summarize, unknowns are genetic sequences that are difficult to identify using standard methods, such as by alignment to an annotated reference database. Unknowns remain a persistent elephant in the room in most metagenomics research projects, and exist for technical, biological, methodological, and logistical reasons. The most promising option to resolve the unknowns is by creating improved reference databases that chart biological sequence space, including the outer realms that remain unexplored by science (also known as dark matter). Besides sequencing reference strains or single cells, it may be expected that metagenomic sequencing, assembly, and binning will greatly add to improving these reference databases, for example by identifying common sequences in many metagenomes, and prioritizing them for targeted characterization. Characterizing unknowns will be vital to fully exploit the increasingly available metagenomic data sets from all ecosystems, toward understanding the roles of microbes and viruses in the biosphere. It remains an open question what is the actual size of biological sequence space, but the untargeted, shotgun nature of metagenomics makes it the most powerful tool to address this question. Metagenomics: application of genomics to uncultured microorganisms Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla Human gut microbiome viewed across age and geography Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes Natural selection and the concept of a protein space How much of protein sequence space has been explored by life on Earth Metagenomics and future perspectives in virus discovery TagDust-a program to eliminate artifacts from next generation sequencing data UCHIME improves sensitivity and speed of chimera detection Insights into the phylogeny and coding potential of microbial dark matter Scratching the surface of biology's dark matter Metagenomic analysis of the human distal gut microbiome A human gut microbial gene catalogue established by metagenomic sequencing Enterotypes of the human gut microbiome A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Genomes from metagenomics Meeting report: the terabase metagenomics workshop and the vision of an Earth microbiome project The pace and proliferation of biological technologies The minimum information about a genome sequence (MIGS) specification A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes UMARS: Un-MAppable Reads Solution I thank my collaborators for their contributions in the crAssphage project, and the anonymous reviewers of this manuscript for valuable suggestions.