key: cord-0302480-2f46aef8
authors: Ziemski, Michal; Adamov, Anja; Kim, Lina; Flörl, Lena; Bokulich, Nicholas A.
title: Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data using q2-fondue
date: 2022-03-25
journal: bioRxiv
DOI: 10.1101/2022.03.22.485322
sha: 3de971a058117fdc4d89afba5d218cf726648d01
doc_id: 302480
cord_uid: 2f46aef8

The volume of public nucleotide sequence data has blossomed over the past two decades, enabling novel discoveries via re-analysis, meta-analyses, and comparative studies for uncovering general biological trends. However, reproducible re-use and management of sequence datasets remains a challenge. We created the software plugin q2-fondue to enable user-friendly acquisition, re-use, and management of public nucleotide sequence (meta)data while adhering to open data principles. The software allows fully provenance-tracked programmatic access to and management of data from the Sequence Read Archive (SRA). Sequence data and accompanying metadata retrieved with q2-fondue follow a validated format, which is interoperable with the QIIME 2 ecosystem and its multiple user interfaces. To highlight the manifold capabilities of q2-fondue, we present several demonstration analyses using amplicon, whole genome, and shotgun metagenome datasets. These use cases demonstrate how q2-fondue increases analysis reproducibility and transparency from data download to final visualizations by including source details in the integrated provenance graph. We believe q2-fondue will lower existing barriers to comparative analyses of nucleotide sequence data, enabling more transparent, open, and reproducible conduct of meta-analyses. q2-fondue is a Python 3 package released under the BSD 3-clause license at https://github.com/bokulich-lab/q2-fondue.

The increasing volume of publicly available nucleotide sequence data is driving a revolution in 29 the life sciences, by enabling comparative studies to discover generalizable trends that are often 30 inaccessible or underpowered in an individual study. Individual studies addressing similar 31 biological questions can differ in many technical aspects, including (but not limited to) specific 32 experimental design, employed sequencing technologies, definitions of the examined target 33 variables, and selection of potential covariates influencing the target. These inter-study 34 variations can make individual study results biased (Serghiou et al., 2016) and even 35 contradictory to one another (Ioannidis & Trikalinos, 2005) . Meta-analysis allows the synthesis of 36 findings from individual studies to reach a more complete understanding: identifying consistent 37

and reproducible signatures across studies, and resolving causes of variation among study 38 results (Gurevitch et al., 2018) . 39

Meta-analyses of nucleotide sequencing-based studies have intensified within the past decade 40 (see Figure 1 ), given the high potential of these data for reuse in comparative analyses. Meta-41

analyses of genome-wide association data have expanded our knowledge of human polygenic 42 disorders and quantitative traits (Panagiotou et al., 2013) . Comparative genomics has given 43 insights into vertebrate genome evolution (Meadows & Lindblad-Toh, 2017 ) and the processes of 44 genome function, speciation, selection and adaptation (Alföldi & Lindblad-Toh, 2013 Displayed article counts were retrieved from PubMed on 2022/02/21 with the search query "(meta-analysis) AND ((omics) OR (genom*) OR (microbio*) OR (transcriptom*))" and a requirement for the article type to be a meta-analysis.

In order to provide consistent and reliable findings, meta-analyses must follow Findable, 75

Accessible be installed in a conda environment on any UNIX-based system as described in the installation 95

instructions provided on the package website (https://github.com/bokulich-lab/q2-fondue). q2-96

fondue has been implemented as a QIIME 2 plugin, allowing use of QIIME 2's integrated data 97

provenance tracking system, multiple user interfaces, and streamlined interoperability with 98 downstream sequence analysis plugins. 99

An overview of q2-fondue is shown in Figure 2 . Two separate q2-fondue actions allow easy 100 access to the SRA database: get-sequences and get-metadata fetch per-sample sequence data 101 and corresponding metadata (e.g., sample and run information), respectively. The get-all pipeline 102 wraps both of these actions to simultaneously retrieve sequences and metadata for a list of SRA 103

accessions. These three actions, get-sequences, get-metadata and get-all, require a single input 104 file containing accession IDs of SRA runs or BioProjects to fetch. SRA run IDs allow direct 105 interaction with the SRA databases, while BioProject IDs are first translated into corresponding 106

run IDs using a chain of E-Direct utilities (Kans, 2013) . An E-Search query is executed to look up 107

provided IDs in the BioProject database, followed by an E-Link query finalized by an E-Fetch 108 query to retrieve the linked run IDs. 109 110 Figure 2 . Overview of q2-fondue methods. get-sequences method can be used to fetch raw sequencing data (single-and/or paired-end), while get-metadata can download corresponding metadata. Both methods can be run simultaneously by using the get-all pipeline. Sequences obtained from multiple fetches can be combined using combine-seqs and multiple metadata artifacts can be merged with merge-metadata. Run and Bioproject IDs can be retrieved from Zotero web library collections with the scrape-collection action.

All data-fetching actions support configurable parallelization to maximally reduce the processing 112 time. The get-metadata method employs a multi-threading approach built into the Entrezpy 113 modules (see the Metadata retrieval section), while get-sequences uses multiple processes and 114 queues to coordinate the data download with its pre-and post-processing steps. 115

The get-sequences action makes use of the sra-tools command-line toolkit (Leinonen, 117 Sugawara, et al., 2011) https://github.com/ncbi/sra-tools) developed by NCBI. The prefetch tool 118

is first used to reliably fetch SRA datafiles using the provided run IDs and the fasterq-dump utility 119

is then executed to retrieve the corresponding sequences (single-or paired-end) in the FASTQ 120

format. To follow QIIME 2's naming convention those sequences are then renamed using their 121

accession IDs, compressed, and finally validated by QIIME 2's built-in type validation system. 122

The action keeps track of any errors that occurred while fetching the sequences and performs 123 available storage checks on every iteration to ensure no data is lost when space is exhausted 124 during download. There are three output files generated by the get-sequences action: two QIIME 125 artifacts corresponding to single-and paired-end reads, respectively, and one EFetchResult and EFetchAnalyzer that work in tandem to request and parse metadata for the 132 provided run IDs. The result is represented as a table where a single row corresponds to one 133 SRA run and columns reflect all the metadata fields found in the obtained response. In order to 134 keep track of different metadata levels (study, sample, experiment, and run), we introduced a set 135 of cascading Python data classes to delineate the hierarchical relationship of the SRA metadata 136 entries ( Figure 3 ) and to preserve this structure in the final study metadata. Moreover, tight 137 integration with QIIME 2's internal type validation system guarantees consistency of metadata 138 generated by q2-fondue by ensuring presence of all required metadata fields, as specified by 139 NCBI. 140

The q2-fondue package contains additional functions to simplify sequencing (meta)data retrieval 142 and manipulation, particularly when multiple data fetches are necessary. 143 ➔ get-all allows simultaneous download of sequences and related metadata. 144

➔ merge-metadata concatenates metadata tables obtained from several independent get-145 metadata runs and allows generation of a single, unified metadata artifact. 146

➔ combine-seqs merges sequences obtained from multiple artifacts obtained from several 147 get-sequences runs (or from other external sources) into a single sequence artifact. 148

➔ scrape-collection retrieves SRA run and BioProject IDs from a Zotero web library 149 collection (https://www.zotero.org) by using the pyzotero package (Hügel et al., 2019) . 150

This enables seamless workflows for collecting IDs of interest from a literature collection, 151

automatically downloading the data, and processing downstream with q2-fondue and 152 QIIME 2. 153 Figure 3 . Structure of the SRA metadata data classes used by the get-metadata action. Each of the classes represents a different level of metadata organization and can contain other nested metadata objects. As all the objects are linked together, metadata of the entire study and all its children can be retrieved directly from the SRAStudy object.

The q2-fondue plugin can be used to facilitate comparative analysis of any nucleotide sequence 156 data and metadata available on the SRA. To demonstrate some example use cases, we used 157 q2-fondue and QIIME 2 to analyze marker-gene, whole genome sequence, and shotgun 158 metagenome data. 

Any meta-analysis can be carried out using raw experimental data, its associated metadata, or a 244 combination of both. To demonstrate the versatility of q2-fondue in all those scenarios, and 245 seamless integration/interoperability with downstream bioinformatics tools, we performed three 246 example use case meta-analyses using amplicon, whole genome, and shotgun metagenome 247 sequencing data and related metadata. All three use cases exclusively employ QIIME 2 plugins 248 to process received data, and illustrate how q2-fondue can immensely increase data analysis 249 reproducibility and transparency by including details on the raw data fetching in the QIIME 2 250

provenance. 251

Use case 1: amplicon sequencing reported by those studies to fetch the corresponding raw metadata and sequencing data. This 257 provided us with 350 sequence samples each annotated with 148 metadata features. 258

After performing filtering, normalization, and denoising steps on the raw 16S rRNA gene 259 sequences (see Figure 4A for an overview of plugins and actions used throughout this use 260 case), a total of 3'880 amplicon sequencing variants (ASVs) were identified for 330 samples. 261

The available metadata was used to define binned age groups. The distribution of samples per 262 age group as well as the analyzed age range differ per study ( Figure 4B) . We further defined a 263 binary health status which denotes whether the sample stems from a healthy or unhealthy infant 264 (see Methods for more details). Across all studies, 194 unhealthy and 136 healthy infant 265 samples were identified. Figure 4C displays the fraction of healthy infants in each of the three 266 geographic locations covered by the selected studies. 267

Finally, we trained two Random Forest classifiers with 10-fold cross-validation on the processed 268 microbiome sequence data to predict age group and health status of each sample, respectively. 269

The classifiers were evaluated on the test set of each fold and revealed a better performance in 270 predicting age groups (macro averaged AUC = 0.85, Figure 4D ) than health status (macro 271 averaged AUC = 0.58, Figure 4E ). 272 Use case 2: whole genome sequencing 273 To illustrate how q2-fondue can be used as an entry-point to analysis of whole genome 274 sequencing data we turned to one of the most rapidly growing datasets of the recent years: the 275 SARS-CoV-2 genome dataset. We used all of the pre-processed metadata obtained through the 276

Nextstrain.org platform (Hadfield et al., 2018) to identify samples which have been deposited in 277 the SRA. We subsampled genomes of three SARS-CoV-2 variants: Alpha, Delta and Omicron 278

(according to Nextstrain's clade naming strategy and WHO's variant labeling convention). We 279 then fetched the corresponding SRA metadata using q2-fondue, which was used to prepare our 280 final list of genomes. To simplify the analysis and reduce technical variability, we focused only on 281 samples sequenced using single-end reads on the Illumina NextSeq 550 platform. Following the 282 quality control step, we used the sourmash tool to readily compare viral genomes to one another 283 by computing their MinHash signatures (Ondov et al., 2016) . The resulting distance matrix was 284 then used to generate a t-SNE plot visualizing how sampled genomes group together. Figure 5B  285 shows that the Omicron variant forms a separate cluster in the t-SNE space and the Alpha and 286

Delta clades, while distinguishable from one another in a form of several smaller clusters, cannot 287 be as clearly separated into two groups. Finally, we used k-nearest-neighbors clustering to 288 quantitatively compare genome MinHash signatures to predict SARS-CoV-2 clade membership 289 ( Figure 5C ). We found that it was possible to classify the three SARS-CoV-2 variants with an 290 accuracy of 93%. An overview of plugins and actions applied in this use case can be found in 291 Figure 5A . 292 Use case 3: shotgun metagenome sequencing 293 We used the Tara Ocean expedition dataset (Tara Oceans Consortium Coordinators et al., 294 2015) to illustrate how geographic location included in sample metadata deposited in the SRA 295

can be used to display sample properties, using q2-fondue and QIIME 2 (see Figure 6A for an 296 overview of plugins and actions used throughout this use case). We fetched metadata for six 297

BioProjects containing 1'049 ocean samples obtained through size fractionation followed by 298

shotgun metagenome sequencing ( Figure 6B -C). As geographical coordinates of every sample 299 are included in the SRA metadata, we could directly draw an array of interactive maps 300 visualizing various sample properties using the q2-coordinates plugin (N. Bokulich & Caporaso, 301 2018). As an example, Figure 6D illustrates sample temperatures across the globe. Moreover, 302

we randomly selected 10 samples collected at two distinct locations and used the corresponding 303 sequences to calculate and compare their MinHash signatures. Using PCoA analysis of the 304 resulting distance matrix, we could show that the samples can be separated by location when 305 using only their genome hash signatures ( Figure 6E ). More interactive visualizations can be 306 found in the Jupyter notebook accompanying this manuscript (see Supplementary material). 307 Integration with QIIME 2 ecosystem 310 Since q2-fondue is a QIIME 2 plugin, it tightly integrates with and benefits from the rest of the 311 QIIME 2 ecosystem. Sequences obtained through the get-sequences action can be directly 312 plugged into any other QIIME 2 action that operates on this data type (see Figure 7 for an 313 overview of actions applied in this study). In addition to defining format checks for SRA metadata 314 objects, q2-fondue has implemented transformer functions to allow the metadata downloaded 315 through the use of get-metadata action to serve as input to any QIIME 2 action that requires 316 sample metadata. Furthermore, integration with QIIME 2's built-in provenance tracking system 317 ensures that data fetching from the SRA is also included in the provenance graph (stored directly 318 in all data outputs), enabling researchers to track and completely reproduce the entire analysis 319 pipeline from data download to final visualizations. 320 bottlenecks. We developed q2-fondue to lower these hurdles, and to facilitate reproducible 328 acquisition and management of metadata and nucleotide sequence datasets from the SRA (see 329 Table 1 for a summary of the most important features). Its integration with the QIIME 2 330 framework offers complete provenance tracking of the entire process, multiple user interfaces, 331

and thorough input/output data validation, allowing to conduct meta-analyses in a reproducible 332 manner. Furthermore, q2-fondue outputs can be directly used with a wide range of QIIME 2 333 plugins, offering the user a smooth incorporation with any sequence-based analysis that is (or 334 will become) available within the QIIME 2 ecosystem. Finally, q2-fondue's integration with QIIME 335 2 offers users unparalleled support through the QIIME 2 forum -an exchange platform between 336 users and plugin developers (with a current total of 5'700 signed-up members). 337

Despite its ease of use, q2-fondue does not free the user of their due diligence in checking the 338 details on the extracted datasets in the accompanying publications, where mismatches with 339 obtained run metadata or sequences could be detected. 340

The q2-fondue demonstrations shown here represent only a few possible use cases for the 341 software, and we envision many other possible applications for analysis of diverse nucleotide 342 sequence data types. 343

The q2-fondue package remains under active development, and several additional functionality 345 upgrades are planned in the future. As q2-fondue operates on large amounts of sequencing 346 data we will introduce several performance-enhancing updates that will allow better 347 management of free storage space available during download as well as streamline 348

downloading large numbers of accession IDs to avoid multiple re-fetches. 349

While using run and BioProject accession IDs may cover needs of a large group of users, we 350 are planning to additionally enable retrieving data using other kinds of IDs, notably SRA's Study 351 ID, such that even more studies deposited in SRA can be downloaded using q2-fondue in a 352 more flexible way. 353 q2-fondue's metadata retrieval action already greatly simplifies downloading metadata of  354  multiple projects and formatting those as a single result table. Several additional functions are  355 planned to assist with management and integration of diverse study and sample metadata. 356 Figure 7 . Overview of q2-fondue integration with other QIIME 2 plugins and actions as applied in the three use cases presented in this study. This is only a limited demonstration of possible downstream uses for three different nucleotide sequence data types, not an exhaustive list.

Finally, to unlock the potential of sequencing data stored in and processed by other repositories 357

we will add support for (meta)data retrieval from various other databases (e.g., MGnify (Mitchell 358 et al., 2020)). Altogether, we hope that q2-fondue can become the tool of choice for interacting 359 with SRA and other similar repositories, while at the same time seamlessly integrating with the 360 whole QIIME 2 ecosystem, hence enabling a wide range of available analysis types. 361 362

Plethora of accession ID types complicates retrieval of sequences/metadata.

Conversion between BioProject and run IDs is performed automatically. All possible accession IDs are recorded in the final metadata table.

Potential data loss on space exhaustion when fetching large amounts of runs.

q2-fondue keeps track of available disk space and will abort without data loss when the amount of space is insufficient.

Sequencing data requires preprocessing/name normalization before it can be used in downstream analyses.

q2-fondue takes care of renaming/standardizing all the files after retrieval.

Merged datasets and subsequent data analysis steps are not always reproducible.

Tight integration with QIIME 2 ensures that every data fetching and analysis detail is recorded in provenance stored together with every single output.

Diversity of metadata fetched from multiple studies complicates its application in subsequent analyses.

Metadata retrieved by q2-fondue is normalized into a single table with standardized columns.

Network issues and other errors lead to data loss and require cumbersome, repeated data fetches.

Data retrieval can be automatically repeated in case of encountered errors. In case of repeated failures, all errors are reported and can be investigated by the user once the download is finished. No data loss occurs.

Parallelization of custom SRA access scripts is complicated and time-consuming.

q2-fondue takes care of data retrieval/processing in a parallel way, making use of multiple threads and CPUs available on the user's system. 

Redondoviridae, a family of small, circular DNA viruses of 381 the human oro-respiratory tract that are associated with periodontitis and critical illness

Comparative genomics as a tool to understand evolution 384 and disease

Toward unrestricted use of public genomic data

1,500 scientists lift the lid on reproducibility

Building Global Infrastructure for Data Sharing and 394 Exchange Through the Research Data Alliance. D-Lib Magazine

Machine-learning tools for microbiome classification and regression

Optimizing taxonomic classification of marker-gene 401 amplicon sequences with QIIME 2's q2-feature-classifier plugin

Reproducible, interactive, scalable and extensible 409 microbiome data science using QIIME 2

Entrezpy: A Python library to dynamically interact with 412 the NCBI Entrez databases

DADA2: High-resolution sample inference from Illumina amplicon data

pysradb: A Python package to query next-generation sequencing 418 metadata and data from NCBI Sequence Read Archive (8:532)

Growth and Morbidity of Gambian Infants 422 are Influenced by Maternal Milk Oligosaccharides and Infant Gut Microbiota

Meta-analysis and the science 425 of research synthesis

Nextstrain: Real-time tracking of pathogen evolution

An algorithm for the principal 430 component analysis of large data sets

Comparability and reproducibility of biomedical data

Matplotlib: A 2D Graphics Environment

Early extreme contradictory estimates may appear 440 in published research: The Proteus phenomenon in molecular genetics research and 441 randomized trials

Entrez Direct: E-utilities on the Unix Command Line

The 447 Sequence Read Archive: A decade more of explosive growth

Experimenting with reproducibility: A case study of 450 robustness in bioinformatics

& on behalf of the International Nucleotide Sequence 453 Database Collaboration

The European Nucleotide Archive

& International Nucleotide Sequence Database 462 Collaboration

The Fecal Microbial Community of Breast-fed Infants from Armenia 467 and Georgia

Phylogenetically Novel 469 Uncultured Microbial Cells Dominate Earth Microbiomes

Cutadapt removes adapter sequences from high-throughput sequencing 472 reads

DNA Data Bank of Japan

Anemia in infancy is associated with alterations in systemic metabolism and microbial 478 structure and function in a sex-specific manner: An observational study. The American 479

Data Structures for Statistical Computing in Python

Liberating 483 field science samples and data

Dissecting evolution and disease using 485 comparative vertebrate genomics

The 489 metagenomics RAST server-A public resource for the automatic phylogenetic and 490 functional analysis of metagenomes

The microbiome analysis resource in 2020

Mash: Fast genome and metagenome distance estimation using 499

Recovery of nearly 8,000 metagenome-505 assembled genomes substantially expands the tree of life

Scikit-learn: Machine Learning in Python

Challenges and Opportunities of 516 Open Data in Ecology

Field-wide meta-518 analyses of observational associations can map selective availability of risk factors and 519 the impact of model specifications

Big Data: Astronomical or Genomical?

Tara Oceans Consortium Coordinators

Open science resources for the discovery and analysis of Tara 527 Oceans data

seaborn: Statistical data visualization

The FAIR Guiding Principles for scientific data management and 541 stewardship

Minimum information about a marker gene sequence (MIMARKS) 546 and minimum information about any (x) sequence (MIxS) specifications

iMicrobe: Tools and data-driven discovery 550 platform for the microbiome sciences

A network approach to 553 elucidate and prioritize microbial dark matter in microbial communities. The ISME 554

SRAdb: Query and use public 556 next-generation sequencing data from within R

 373 We thank Evan Bolyen (Northern Arizona University) for insightful discussions on metadata 374 processing and working with the SRA and SRA Toolkit. We also thank Anton Lavrinienko (ETH 375Zürich) for his valuable comments on the manuscript. 376

The authors declare no existing competing interests. 378