key: cord-1010065-ynpa8nfh authors: Larkin, Alyse A.; Garcia, Catherine A.; Garcia, Nathan; Brock, Melissa L.; Lee, Jenna A.; Ustick, Lucas J.; Barbero, Leticia; Carter, Brendan R.; Sonnerup, Rolf E.; Talley, Lynne D.; Tarran, Glen A.; Volkov, Denis L.; Martiny, Adam C. title: High spatial resolution global ocean metagenomes from Bio-GO-SHIP repeat hydrography transects date: 2021-04-16 journal: Sci Data DOI: 10.1038/s41597-021-00889-9 sha: c6d55350c211101bb3d36316cbabc0fdfe463b2d doc_id: 1010065 cord_uid: ynpa8nfh Detailed descriptions of microbial communities have lagged far behind physical and chemical measurements in the marine environment. Here, we present 971 globally distributed surface ocean metagenomes collected at high spatio-temporal resolution. Our low-cost metagenomic sequencing protocol produced 3.65 terabases of data, where the median number of base pairs per sample was 3.41 billion. The median distance between sampling stations was 26 km. The metagenomic libraries described here were collected as a part of a biological initiative for the Global Ocean Ship-based Hydrographic Investigations Program, or “Bio-GO-SHIP.” One of the primary aims of GO-SHIP is to produce high spatial and vertical resolution measurements of key state variables to directly quantify climate change impacts on ocean environments. By similarly collecting marine metagenomes at high spatiotemporal resolution, we expect that this dataset will help answer questions about the link between microbial communities and biogeochemical fluxes in a changing ocean. www.nature.com/scientificdata www.nature.com/scientificdata/ comparison, systematic and sustained biological measurements of the microbial component of ocean ecosystems has lagged far behind. We present a dataset of 971 ocean surface water metagenomes collected at high spatio-temporal resolution in an effort to more mechanistically link marine microbial traits and biodiversity to both chemical and hydrodynamic ecosystem fluxes as a part of a novel Bio-GO-SHIP sampling program. Samples were collected in the Atlantic, Pacific, and Indian Ocean basins (Fig. 1 , Table 1 ). This effort has been supported by GO-SHIP, SOCCOM, the Plymouth Marine Laboratory Atlantic Meridional Transect (PML AMT), and three National Science Foundation (NSF) Dimensions of Biodiversity funded cruises (AE1319, BVAL46, and NH1418) ( Table 2) . Whereas the median distance between Tara Oceans sampling stations was 709 km and the median distance between bioGEOTRACES sampling stations was 191 km, the median distance between sampling stations in the current Bio-GO-SHIP dataset is 26.5 km (Fig. 2 ). In addition, the majority of Bio-GO-SHIP samples were collected every 4-6 hours, allowing for analysis of diel fluctuations in microbial composition and gene content 12 . We anticipate that our high-resolution sampling scheme will allow for a more detailed examination of the relationship between the broad range of geochemical parameters measured across the various cruises (Table 2 ) and microbial diversity and traits. Due to their rapid generation times and high diversity, microbial genomes integrate the impact of environmental change 13 and can be used as a 'biosensor' of subtle biogeochemical regimes that cannot be identified from physical parameters alone 12, [14] [15] [16] . Thus, the fields of microbial ecology and oceanography would benefit from coordinated, high resolution measurements of marine 'omics products (i.e., metagenomes, metatranscriptomes, metaproteomes, etc.). This dataset provides an important example of the benefits of a high spatial and temporal resolution sampling regime. In addition, our data highlights the need for increased sampling of marine metagenomes in the Central and Western Pacific Ocean (Fig. 1) , areas above 50°N and 50°S (Fig. 2) , and below the euphotic zone. We hope and expect that these challenges will be addressed by the scientific community in the coming decade. On all cruises, whole (i.e., no size fractionation) surface water was collected via either the Niskin rosette system (depth ~3-5 m) or the ship's circulating seawater system (depth ~7 m). Between 2-10 L of surface water (Table 1) was collected in triple-rinsed containers and gently filtered through a 0.22 μm pore size Sterivex filter (Millipore, Darmstadt, Germany) using sterilized tubing and a Masterflex peristaltic pump (Cole-Parmer, Vernon Hills, IL). DNA was preserved with 1620 μL of lysis buffer (4 mM NaCl, 750 μM sucrose, 50 mM Tris-HCl, 20 mM EDTA) and stored at −20 °C before extraction. www.nature.com/scientificdata www.nature.com/scientificdata/ To extract DNA (modified from Bostrom et al. 2004) 17 Sterivex filters were incubated with 180 μL lysozyme (3.5 nM) at 37 °C for 30 minutes followed by an overnight 55 °C incubation with 180 μL Proteinase K (0.35 nM) and 100 μL 10% SDS buffer. DNA was extracted from the Sterivex with 1000 μl TE buffer (10 mM Tris-HCl, 1 mM EDTA), precipitated in an ice-cold solution of 500 μL isopropanol (100%) and 1980 μL sodium acetate (3 mM, pH 5.2), pelleted via centrifuge for 30 mins at 4 °C, and resuspended in TE buffer in a 37°C water bath for 30 min. Next, DNA was purified using a genomic DNA Clean and Concentrator kit (Zymo Research Corp., Irvine, CA). Finally, DNA concentrations were quantified using a Qubit dsDNA HS Assay kit and Qubit fluorometer (ThermoFisher, Waltham, MA). A total of 971 metagenomic libraries from 932 locations were prepared using Illumina-specific Nextera DNA transposase adapters and a Tagment To quality control tagmentation products, dimers that were less than 150 nucleotides long were removed using a buffered solution (1 M NaCl, 1 mM EDTA, 10 mM Tris-HCl, 44.4 M PEG-8000, 0.055% Tween-20 final concentration) of Sera-mag SpeedBeads (ThermoFisher, Waltham, MA). Metagenomic libraries were quantified using a Qubit dsDNA HS Assay kit (ThermoFisher, Waltham, MA) and a Synergy 2 Microplate Reader (BioTek, Winooski, VT). Libraries were then pooled at equimolar concentrations. Pooled library concentration was verified using a KAPA qPCR platform (Roche, Basel, Switzerland). Finally, dimer removal as well as read size distribution were checked using a 2100 Bioanalyzer high sensitivity DNA trace (Agilent, Santa Clara, CA). 54 samples were sequenced on two Illumina HiSeq 4000 lanes using 150 bp paired-end chemistry with 300 cycles (Illumina, San Diego, CA). A total of 666 samples were sequenced on three Illumina NovaSeq S4 flowcells and an additional 251 samples were sequenced on a combination of S1 and SP flowcells using 150 bp paired-end chemistry with 300 cycles. The sequencing strategy produced a total of 2.42 × 10 10 reads, or 3.65 × 10 12 bp. The Table 2 . Publicly available metadata variables collected on Bio-GO-SHIP cruises. These data may be updated as additional samples or stations are processed by the principal investigators of each dataset. Another 48 metadata variables not listed here were collected aboard the GO-SHIP, PML AMT, and NSF cruises and may be available upon request from CCDHO, BODC, or SOCCOM. *C13.5 is a partial occupation of the A13.5 GO-SHIP line that was aborted due to COVID-19. Thus, CTD casts corresponding to DNA collection were only performed at 8 stations. www.nature.com/scientificdata www.nature.com/scientificdata/ median number of bases per sample was 3.41 billion (range: 61,400,000-21.4 billion). Prior to read trimming and quality filtering, 74% of all forward and reverse reads had an average quality score ≥Q25 (Table 1 ). The sequencing cost per bp in US dollars was $8.2 × 10 −9 . The majority of the samples here were collected under the auspices of the international GO-SHIP program and the national programs that contribute to it [21] [22] [23] [24] . Links to publicly available metadata variables collected via CTD cast are provided in Table 2 All sequencing products associated with the Bio-GO-SHIP program can be found under BioProject ID PRJNA656268 hosted by the National Center for Biotechnology Information Sequence Read Archive (SRA) 30 . SRA accession numbers associated with each metagenome file are provided in Supplementary Table 1 . To ensure that no contamination of metagenomes occurred, negative controls were used. To ensure optimum paired-end short read sequencing, a 2100 Bioanalyzer high sensitivity DNA trace (Agilent, Santa Clara, CA) was used for each library to confirm that ~90% of the sequence fragments were above 200 bp and below 600 bp in length (Table 3) . A Qubit (ThermoFisher, Waltham, MA) and a KAPA qPCR platform (Roche, Basel, Switzerland) were used to ensure that all pooled libraries were submitted for sequencing at a concentration > 15 nM. Table 3 . Sequencing run breakdown of Bio-GO-SHIP metagenomes including technical validation statistics. *Run 1 was concentrated via SpeedVac to 15 nM and bead size-selected such that 90% of fragments were between 200-600 bp by the UC Davis Genome Center DNA Technologies Core prior to sequencing. Final values for this run are not available. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific Open science resources for the discovery and analysis of Tara Oceans data Marine microbial metagenomes sampled across space and time Understanding how microbiomes influence the systems they inhabit Autonomous oceanographic sampling networks Fifteen years of ocean observations with the global Argo array Decline in global oceanic oxygen content during the past five decades Ocean acidification: The other CO2 problem Impact of anthropogenic CO2 on the CaCO3 system in the oceans Warming of global abyssal and deep Southern Ocean waters between the 1990s and 2000s: Contributions to global heat and sea level rise budgets The oceanic sink for anthropogenic CO2 from Subtle biogeochemical regimes in the Indian Ocean revealed by spatial and diel frequency of Prochlorococcus haplotypes Microdiversity shapes the traits, niche space, and biogeography of microbial taxa Linking regional shifts in microbial genome adaptation with surface ocean biogeochemistry Genomic adaptation of marine phytoplankton populations regulates phosphate uptake Elucidating ecological complexity: Unsupervised learning determines global marine eco-provinces Optimization of DNA extraction for quantitative marine bacterioplankton community analysis Inexpensive multiplexed library preparation for megabase-sized genomes Predictable molecular adaptation of coevolving Enterococcus faecium and lytic phage EfV12-phi1 Cervicovaginal microbiome composition is associated with metabolic profiles in healthy pregnancy Bottle data from Cruise 33RR20160321, exchange version. CCHDO: CLIVAR and Carbon Hydrographic Data Office Bottle data from Cruise 33RO20161119, exchange version. CCHDO: CLIVAR and Carbon Hydrographic Data Office Bottle data from Cruise 33RO20180423, exchange version. CCHDO: CLIVAR and Carbon Hydrographic Data Office Bottle data from Cruise 33RO20200321, exchange version. CCHDO: CLIVAR and Carbon Hydrographic Data Office AMT28 (JR18001) CTD profiles (pressure, temperature, salinity, potential temperature, density, fluorescence, transmittance, downwelling PAR, dissolved oxygen concentration) calibrated and binned to 1 dbar. British Oceanographic Data Centre Biological and Chemical Oceanography Data Management Office Depth profile data from Bermuda Atlantic Time-Series Validation cruise 46 (BVAL46) in the Sargasso Sea from Depth profile data from R/V Atlantic Explorer AE1319 in the NW Atlantic from Depth profile data from R/V New Horizons New Horizons NH1418 in the tropical Pacific from Financial support for this project was provided by the National Science Foundation (OCE-1046297, 1559002, 1848576, and 1948842 to ACM). LJU was supported by the National Institutes of Health (T32AI141346). LB and DLV were supported in part under the auspices of the Cooperative Institute for Marine and Atmospheric Studies (CIMAS), a cooperative institute of the University of Miami and NOAA (cooperative agreement NA10OAR4320143). The PML AMT is funded by the UK Natural Environment Research Council through its National Capability Longterm Single Centre Science Programme, Climate Linked Atlantic Sector Science (grant number NE/R015953/1). This study contributes to the international IMBeR project and is contribution number 357 of the AMT program coordinated sample collection, collected/processed samples, performed metagenomic sequencing, and compiled metadata. N.G. coordinated sample collection, collected/processed samples, and performed metagenomic sequencing. M.L.B. performed metagenomic sequencing and compiled metadata. J.A.L. collected/processed samples and performed metagenomic sequencing. L.J.U. processed samples and compiled metadata The authors declare no competing interests. Supplementary information The online version contains supplementary material available at https://doi. org/10.1038/s41597-021-00889-9.Correspondence and requests for materials should be addressed to A.C.M.Reprints and permissions information is available at www.nature.com/reprints. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.