Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets 1 Struo2: efficient metagenome profiling database construction for ever-expanding 2 microbial genome datasets 3 Nicholas D. Youngblut* ,1 , Ruth E. Ley 1 4 1 Department of Microbiome Science, Max Planck Institute for Developmental Biology, Max Planck Ring 5, 5 72076 Tübingen, Germany 6 * Corresponding author: Nicholas Youngblut (nicholas.youngblut@tuebingen.mpg.de) 7 Running title: Struo2 builds databases faster 8 Key words: metagenome, database, profiling, GTDB 1 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/ 9 Abstract 10 Mapping metagenome reads to reference databases is the standard approach for 11 assessing microbial taxonomic and functional diversity from metagenomic data. However, public 12 reference databases often lack recently generated genomic data such as 13 metagenome-assembled genomes (MAGs), which can limit the sensitivity of read-mapping 14 approaches. We previously developed the Struo pipeline in order to provide a straight-forward 15 method for constructing custom databases; however, the pipeline does not scale well with the 16 ever-increasing number of publicly available microbial genomes. Moreover, the pipeline does 17 not allow for efficient database updating as new data are generated. To address these issues, 18 we developed Struo2, which is >3.5-fold faster than Struo at database generation and can also 19 efficiently update existing databases. We also provide custom Kraken2, Bracken, and 20 HUMAnN3 databases that can be easily updated with new genomes and/or individual gene 21 sequences. Struo2 enables feasible database generation for continually increasing large-scale 22 genomic datasets. 23 Availability: 24 ● Struo2: https://github.com/leylabmpi/Struo2 25 ● Pre-built databases: http://ftp.tue.mpg.de/ebio/projects/struo2/ 26 ● Utility tools: https://github.com/nick-youngblut/gtdb_to_taxdump 27 Results 28 Metagenome profiling involves mapping reads to reference sequence databases and is 29 the standard approach for assessing microbial community taxonomic and functional composition 30 via metagenomic sequencing. Most metagenome profiling software includes “standard” 31 reference databases. For instance, the popular HUMANnN pipeline includes multiple databases 32 for assessing both taxonomy and function from read data (Franzosa et al. , 2018) . Similarly, 33 Kraken2 includes a set of standard databases for taxonomic classification of specific clades 34 ( e.g., fungi or plants) or all taxa (Wood et al. , 2019) . While such standard reference databases 35 provide a crucial resource for metagenomic data analysis, they may not be optimal for the 36 needs of researchers. For example, a custom database that includes newly generated MAGs 37 can increase the percent of reads mapped to references (Youngblut et al. , 2020) . The process 38 of making custom reference databases is often complicated and requires substantial 39 computational resources, which led us to create Struo for straight-forward custom metagenome 40 profiling database generation (de la Cuesta-Zuluaga et al. , 2020) . However, Struo requires ~2.4 41 CPU hours per genome, which would necessitate >77,900 CPU hours (>9.1 years) if including 42 one genome per the 31,911 species in Release 95 of the Genome Taxonomy Database (GTDB) 43 (Parks et al. , 2018) . 44 Struo2 generates Kraken2 and Bracken databases similarly to Struo (Lu et al. , 2017; 45 Wood et al. , 2019) , but the algorithms diverge substantially for the time consuming step of gene 46 annotation required for HUMAnN database construction. Struo2 performs gene annotation by 47 clustering all gene sequences of all genomes using the mmseqs2 linclust algorithm, and then 48 each gene cluster representative is annotated via mmseq2 search (Figure 1A; Supplemental 49 Methods) (Steinegger and Söding, 2017, 2018) . In contrast, Struo annotates all non-redundant 50 genes of each genome with DIAMOND (Buchfink et al. , 2015) . Struo2 utilizes snakemake and 2 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/ 51 conda, which allows for easy installation of all dependencies and simplified scaling to high 52 performance computing systems (Köster and Rahmann, 2012) . 53 Benchmarking on genome subsets from the GTDB showed that Struo2 requires ~0.67 54 CPU hours per genome versus ~2.4 for Struo (Figure 1B). Notably, Struo2 annotates slightly 55 more genes than Struo, possibly due to the sensitivity of the mmseqs search iterative search 56 algorithm (Figure 1C). The use of mmseqs2 allows for efficient database updating of new 57 genomes and/or individual gene sequences via mmseqs clusterupdate (Figure S1); we show 58 that this approach saves 15-19% of the CPU hours relative to generating a database from 59 scratch (Figure 1D). 60 We used Struo2 to create publicly available Kraken2, Bracken, and HUMAnN3 custom 61 databases from Release 95 of the GTDB (see Supplemental Methods). We will continue to 62 publish these custom databases as new GTDB versions are released. The databases are 63 available at http://ftp.tue.mpg.de/ebio/projects/struo2/ . We also created a set of utility tools for 64 generating NCBI taxdump files from the GTDB taxonomy and mapping between the NCBI and 65 GTDB taxonomies. The taxdump files are utilized by Struo2, but these tools can be used more 66 generally to integrate the GTDB taxonomy into existing pipelines designed for the NCBI 67 taxonomy (available at https://github.com/nick-youngblut/gtdb_to_taxdump ). 68 Figure 1. Struo2 can build databases faster than Struo and can efficiently update the databases. A) A 69 general outline of the Struo2 database creation algorithm. Cylinders are input or output files, squares are 70 processes, and right-tilted rhomboids are intermediate files. The largest change from Struo is the 3 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/ 71 utilization of mmseqs2 for clustering and annotation of genes. B) Benchmarking the amount of CPU hours 72 required for Struo and Struo2, depending on the number of input genomes. C) The number of genes 73 annotated with a UniRef90 identifier. D) The percent of CPU hours saved via the Struo2 database 74 updating algorithm versus de novo database generation. The original database was constructed from 75 1000 genomes. For B) and D), the grey regions represent 95% confidence intervals. 76 Data availability 77 Struo2 is available at https://github.com/leylabmpi/Struo2 , the pre-built databases can be 78 found at http://ftp.tue.mpg.de/ebio/projects/struo2/ , and utility tools are located at 79 https://github.com/nick-youngblut/gtdb_to_taxdump . 80 Acknowledgements 81 This study was supported by the Max Planck Society. We thank Albane Ruaud, Liam 82 Fitzstevens, Jacobo de la Cuesta-Zuluaga, and Jillian Waters for providing helpful comments on 83 an earlier version of this manuscript. 84 References 85 Buchfink,B. et al. (2015) Fast and sensitive protein alignment using DIAMOND. Nat. Methods , 86 12 , 59–60. 87 de la Cuesta-Zuluaga,J. et al. (2020) Struo: a pipeline for building custom databases for 88 common metagenome profilers. Bioinformatics , 36 , 2314–2315. 89 Franzosa,E.A. et al. (2018) Species-level functional profiling of metagenomes and 90 metatranscriptomes. Nat. Methods , 15 , 962–968. 91 Köster,J. and Rahmann,S. (2012) Snakemake--a scalable bioinformatics workflow engine. 92 Bioinformatics , 28 , 2520–2522. 93 Lu,J. et al. (2017) Bracken: estimating species abundance in metagenomics data. PeerJ 94 Comput. Sci. , 3 , e104. 95 Parks,D.H. et al. (2018) A standardized bacterial taxonomy based on genome phylogeny 96 substantially revises the tree of life. Nat. Biotechnol. , 36 , 996–1004. 97 Steinegger,M. and Söding,J. (2018) Clustering huge protein sequence sets in linear time. Nat. 98 Commun. , 9 , 2542. 99 Steinegger,M. and Söding,J. (2017) MMseqs2 enables sensitive protein sequence searching for 100 the analysis of massive data sets. Nat. Biotechnol. , 35 , 1026–1028. 101 Wood,D.E. et al. (2019) Improved metagenomic analysis with Kraken 2. Genome Biol. , 20 , 257. 102 Youngblut,N.D. et al. (2020) Large-Scale Metagenome Assembly Reveals Novel 103 Animal-Associated Microbial Genomes, Biosynthetic Gene Clusters, and Other Genetic 104 Diversity. mSystems , 5 . 4 .CC-BY 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430604doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430604 http://creativecommons.org/licenses/by/4.0/