key: cord-0003788-8pr5ispb authors: Warwick-Dugdale, Joanna; Solonenko, Natalie; Moore, Karen; Chittick, Lauren; Gregory, Ann C.; Allen, Michael J.; Sullivan, Matthew B.; Temperton, Ben title: Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands date: 2019-04-25 journal: PeerJ DOI: 10.7717/peerj.6800 sha: b386f0bbf59b80952e0320feacb36572c6c3b9ff doc_id: 3788 cord_uid: 8pr5ispb Marine viruses impact global biogeochemical cycles via their influence on host community structure and function, yet our understanding of viral ecology is constrained by limitations in host culturing and a lack of reference genomes and ‘universal’ gene markers to facilitate community surveys. Short-read viral metagenomic studies have provided clues to viral function and first estimates of global viral gene abundance and distribution, but their assemblies are confounded by populations with high levels of strain evenness and nucleotide diversity (microdiversity), limiting assembly of some of the most abundant viruses on Earth. Such features also challenge assembly across genomic islands containing niche-defining genes that drive ecological speciation. These populations and features may be successfully captured by single-virus genomics and fosmid-based approaches, at least in abundant taxa, but at considerable cost and technical expertise. Here we established a low-cost, low-input, high throughput alternative sequencing and informatics workflow to improve viral metagenomic assemblies using short-read and long-read technology. The ‘VirION’ (Viral, long-read metagenomics via MinION sequencing) approach was first validated using mock communities where it was found to be as relatively quantitative as short-read methods and provided significant improvements in recovery of viral genomes. We then then applied VirION to the first metagenome from a natural viral community from the Western English Channel. In comparison to a short-read only approach, VirION: (i) increased number and completeness of assembled viral genomes; (ii) captured abundant, highly microdiverse virus populations, and (iii) captured more and longer genomic islands. Together, these findings suggest that VirION provides a high throughput and cost-effective alternative to fosmid and single-virus genomic approaches to more comprehensively explore viral communities in nature. For short, hybrid and long-read polished assemblies, nucleotide diversity was calculated as follows: High quality short reads from the Western English Channel were mapped back to viral contigs and reads mapping at <95% identity to any viral contig were removed to evaluate nucleotide diversity within established parameters for viral populations (Brum et al., 2015) . Contigs with >10-fold coverage across 70% of their whole genome were retained for robust calculation of the ratio of non-synonymous to synonymous polymorphism rates (Schloissnig et al., 2013; Brum et al., 2015) . Single nucleotide polymorphisms (SNPs) were identified using mpileup and BCFtools (https://samtools.github.io/bcftools/bcftools.html) and those with a quality score >=30, represented by at least 4 reads and comprising >1% of the base pair coverage for that position were considered true SNPs. SNP frequencies across all genomes were rarefied by subsampling to 10⨉ coverage proportionate to the frequency of different SNPs per site while maintaining SNPs linkages. Parameters were chosen for conservative estimates, in accordance with previously published work (Schloissnig et al., 2013) . Observed nucleotide diversity (⨉⨉) (Nei & Li, 1979) was estimated both per contig (median across the length of the contig) and at a per-base level. We identified genomic islands in viral contigs from short-read only assembly, hybrid assembly and polished long-read assembly of VirION reads, as described previously (Mizuno, Ghai & Rodriguez-Valera, 2014) , with parameter optimisation to account for increased error in long reads Short read data from the Western English Channel was mapped back against the viral contigs > 10kb using bowtie2 (Langmead & Salzberg, 2012) and samtools (Li et al., 2009) . BAM files were filtered using BamM (http://github.com/ecogenomics/BamM) to remove reads mapping at nucleotide identities ranging from 92-98%, to assess any impact of increased sequencing error in long-read assemblies. Contigs with a median per-base coverage of < 5 or those with a Reads per kb of genome per Gbp of reads mapped (RPKG) of < 1 were identified with BamM and removed from analysis. Genomic islands were defined as regions where the median coverage of a 500 bp sliding window was < 20% of the median coverage of the contig (Mizuno, Ghai & Rodriguez-Valera, 2014) . Putative genomic islands were excluded if they were within 500 bp of the end of a contig. If two genomic islands were found within 500 bp of each other, they were combined into a single genomic island. Lengths of genomic islands and density of genomic islands per contig were calculated for each assembly type. Effect size of different assembly types on genomic island length and density and associated 95% confidence intervals (CI) were calculated from bootstrapped medians (Cumming, 2014). As each long read is amplified from a single template strand, it is possible to investigate how conserved the protein content is within genomic islands across viral populations by identifying reads that have different protein content, but which span the same genomic island. First, individual long reads were error corrected with short reads as described previously. Corrected reads with a minimum of 5-fold average coverage for consensus basecalling were selected and mapped back against viral contigs > 10 kbp using minimap2 (https://github.com/lh3/minimap2). Genomic islands with at least 10 VirION reads spanning the full length of the island were selected for further study (137 genomic islands on 84 viral contigs). Median length of predicted proteins on these reads was 75 amino acids (74-76 aa, 95% CI) showing error correction had not been sufficient to correct for the enriched presence of stop codons resulting from indel errors. Therefore, an alternative approach to gene calling was developed. VirION reads spanning genomic islands were trimmed at the 5' and 3' end to leave only the read fragment mapping within the genomic island, and these fragments were used as a query in a BLASTx search against the NR database using diamond (Buchfink, Xie & Huson, 2015) , with the following settings: '-k 500 -more-sensitive -frameshift 15 -subject-cover 20 -evalue 1e-5' Each read fragment had the potential to encode several proteins, along its length, with each putative gene aligning to several similar proteins in the NR database. This information was used to estimate the start and stop loci of each putative gene using an unsupervised learning approach. Identification of the start and stop loci for each putative gene was estimated by hierarchically clustering start and stop loci for each match to an NR protein using Euclidian distance and single linkage. A threshold of 200 bp was used to discriminate between clusters following evaluation of a subset of 100 randomly selected reads at 50, 100, 200 and 500 bp). Below is an example of this clustering performed on read 01360d0f-19c8-4725-9c0f-e28b7b0db8a0_Basecall_2D_2d that mapped to contig P_tig00000038_pilon across its genomic island at 63060-64955 bp. In this example, clustering identified three separate gene products. Minimum start and maximum end loci of each cluster were used to extract putative gene encoding fragments from the VirION reads. Extracted gene fragments were then used as a query against NR using diamond BLASTX as before, but this time returning only the top hit and requiring at least 20 % of the query matched the subject. In total, 6,445 fragments were extracted from 3,072 reads, of which 4,599 returned a hit to a subject within the NR database. Of these, 3,888 (85%) were assigned to proteins whose title included the words 'hypothetical', 'DUF' (domain of unknown function), 'unknown' or 'uncharacterized'. The remaining 711 hits were then used to calculate the total number of unique annotated reads spanning each GI, the predicted functions within each GI and the number of different taxa identified within each GI. To evaluate whether different VirION reads spanning the same GI encoded different functions, two further criteria were evaluated. Firstly, all proteins hits associated with a genomic island were compared in an all-vs-all BLASTP (expect threshold: 10; word size: 6; matrix: BLOSUM62; gap costs: existence 11; Extension: 1). Different function was assumed if proteins had an alignment score <40. Secondly, start coordinates of different functional genes within a GI were compared to evaluate if one function was adjacent to another in the same genome, rather than being encoded in the same locus, but in different viral genomes. Different functional genes were conservatively assumed to be adjacent if they had an overlap of <100 bp to account for inaccurate identification of the start and end loci on errorprone reads. Results are available in Table S5 . Fast and sensitive protein alignment using DIAMOND Trimmomatic: a flexible trimmer for Illumina sequence data Ocean plankton. Patterns and ecological drivers of ocean viral communities Mauve: multiple alignment of conserved genomic sequence with rearrangements Geneious basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data The sequence alignment/map format and SAMtools MetaQUAST: evaluation of metagenome assemblies Genomic variation landscape of the human gut microbiome