key: cord-0427548-9z691gqf authors: Cameron, Daniel L.; Papenfuss, Anthony T. title: VIRUSBreakend: Viral Integration Recognition Using Single Breakends date: 2020-12-11 journal: bioRxiv DOI: 10.1101/2020.12.09.418731 sha: 57ad1fc3518b877d6d22b80a2e2e8948a8dca0af doc_id: 427548 cord_uid: 9z691gqf Integration of viruses into infected host cell DNA can causes DNA damage and can disrupt genes. Recent cost reductions and growth of whole genome sequencing has produced a wealth of data in which viral presence and integration detection is possible. While key research and clinically relevant insights can be uncovered, existing software has not achieved widespread adoption, limited in part due to high computational costs, the inability to detect a wide range of viruses, as well as precision and sensitivity. Here, we describe VIRUSBreakend, a high-speed tool that identifies viral DNA presence and genomic integration recognition tool using single breakend variant calling. Single breakends are breakpoints in which only one side has been unambiguously placed. We show that by using a novel virus-centric single breakend variant calling and assembly approach, viral integrations can be identified with high sensitivity and a near-zero false discovery rate, even when integrated in regions of the host genome with low mappability, such as centromeres and telomeres that cannot be reliably called by existing tools. Applying VIRUSBreakend to a large metastatic cancer cohort, we demonstrate that it can reliably detect clinically relevant viral presence and integration including HPV, HBV, MCPyV, EBV, and HHV-8. Recent advances in sequencing technology have made routine large-scale whole genome sequencing (WGS) possible, including tumour sequencing 5 . These WGS data sets enable the detection of viral integrations through the identification of structural variant breakpoints between the host genome and the viral sequence. While there exist several tools capable of detecting viral integrations in WGS data, these tools have not yet gained widespread adoption. Existing tools fall short in one or more of three areas: the ability to detect more than one virus or virus family, runtime performance, and the inability to detect integrations into repetitive regions of the host genome (such as centromeres). At a high level, WGS viral integration detection software finds integration sites by identifying clusters of reads or read pairs spanning from host reference sequence to viral reference sequence. Viral integration tools such as BatVI 6 , VirTect 7 , and Virus-Clip 8 require a viral reference as input. While some of these tools are true single-virus tools, others can in theory be configured with multiple viral reference genomes. Including related viruses causes read alignment ambiguities when these viruses contain homologous regions. VirusFinder 9 , VirusFinder2/VERSE 10 , and VirusSeq 11 avoid this problem by first identifying viral presence before proceeding to integration detection using a single viral reference genome. These tools are still limited to a single viral reference genome, so a HHV-6 infection may mask the presence of a short genome such as HBV. The Pan-Cancer Analysis of Whole Genomes (PCAWG) project 3 avoided this problem by performing viral read classification prior to viral integration detection but this pipeline is not generally available as a standalone tool and its integration detection performance is determined by their use of VERSE as the sole integration detection tool. The need for an integrated, easy to use, virome-wide integration detection was identified by Chen at al 12 , a gap we propose to fill with VIRUSBreakend. For the vast majority of whole genome sequencing projects, viral integration detection is only one part of a larger analysis. Tools that are not computationally efficient will struggle to gain widespread adoption. Tools such as BatVI and Virus-Clip were developed in direct response to the computational cost of tools such as VirusFinder, VERSE, and VirusSeq. By far, the most computationally expensive step is the alignment of reads to the host and viral genomes and multiple approaches have been taken. VERSE, VirusSeq, and ViralFusionSeq 13 use a host then virus alignment approach, BATVI and Virus-Clip use a virus then host approach, while ViFi 14 and VirTect 7 take a combined host and virus approach. Each of these approaches have their advantages and drawbacks, but host then virus alignment approach has the unique advantage that it uses as input a bam file that will almost certainly have been generated in a typical WGS pipeline. Here, we show that this approach reduces the real-world computational cost of incorporating viral integration detection is less than one tenth of the computational cost of realignment. Finally, and most crucially, all existing viral integration detection tools rely on clusters of read alignments. Even tools such as VirusFinder/VERSE that perform an assembly step, still relying on read host alignment clusters to determine the insertion site. This fundamentally limits the viral integration detection capability in host regions with low mappability. Ignoring low mappability reads will result in false negatives, and reporting either a single arbitrary alignment or all possible alignments of a multi-mapping reads with both result in a high false positive rate and overestimation of the number of insertion sites. Our solution to this problem is to perform single breakend variant calling on the viral reference genome. Single breakends are breakpoints in which only one side is uniquely aligned to the reference genome 15 . By first identifying where in the viral genome an integration site occurs and assembling the host sequence adjacent to the integration, we obviate the problem of multi-mapping host read alignments. In cases where the host location cannot be unambiguously determined from the assembled contig, the contig sequence provides information about the repeat context of the integration site. Here we present a novel single breakend-based approach that can reliably detect viral integrations anywhere in the host genome. By identifying and assembling single breakend variants in the virus genome followed by taxonomic classification and alignment of the breakend contigs, VIRUSBreakend is able to reliably identify viral integrations in regions inaccessible to current integration detection approaches. VIRUSBreakend uses a multistage approach to identifying viral insertions ( Figure 1 ). Starting with a host-aligned SAM/BAM/CRAM file, VIRUSBreakend identifies viral reads of interest through Kraken2 16 taxonomic classification of all unaligned or partially aligned sequences using. If the read is at least partially classified as a virus, the full read pair is considered for further analysis. Viral read pairs are aligned to a viral reference consisting of the most abundant human-infecting virus in each genus. SNVs called with bcftools and the viral reference modified to incorporate these SNVs. Viral read pairs are realigned to the updated reference and structural variants called using GRIDSS2 17 and filtered to single breakends. Single breakends are breakpoints in which one side cannot be unambiguously aligned to the (viral) reference. In the case of viral integration breakpoints, this is because the viral reference does not include the host genome. As well as the location and orientation of the break junction, GRIDSS2 reports the assembled sequence for every single breakend. The assembled single breakend sequence is used to identify the host integration site by aligning to the host reference. Identified breakpoints fall into two categories: sites in which the host mapping is unambiguous, and ambiguous sites in which the integration site cannot be unambiguously determined (such as integrations into alpha satellite repeats). To facilitate downstream analysis, integration sites are annotated with the RepeatMasker 18 repeat type and class of the single breakend sequence. To evaluate theoretical performance, we created a synthetic benchmark with realistic insertion sites and compared VIRUSBreakend to VERSE, ViFi, and BATVI, as well as GRIDSS2. GRIDSS2 was included both as a breakpoint caller against a reference containing the human and viral genomes, as well as to evaluate the performance of host-centric single breakend calling, and was selected due to its performance 19, 20 . Insertion sites were simulated by inserting 2000bp of the non-reference HBV strain LC500247.1 at each 1Mb interval along chromosome 1 of the Telomere-to-Telomere consortium whole genome assembly of CHM13. Viral integration callers were run using their default settings with hg19 used as the host reference genome. To account for differences between CHM13 and hg19 coordinates, insertions were considered correct and counted as true positives if the viral coordinates were within 750bp and the insertion site coordinates were within 1Mb. Insertions were considered homologous calls if matching only on the viral side. Whether EBV is a risk factor for lung cancer is still subject to debate 23 . While we did find EBV viral presence enriched in lung cancer (69/666, p=0.000002), only 2 integration sites were found. In all samples, viral depth of coverage was less than 2.5% of the host indicating that EBV is not clonally integrated into the tumour in any sample. When viruses are integrated into low mappability sequences, existing read mapped based approaches must choose between erring on the side of caution and omitting these calls, or aiming for high sensitivity at the cost of a high false discovery rate. By taking a virus-centric single breakend approach, VIRUSBreakend solves this dilemma and enables both accurate and sensitive integration detection even in regions of low mappability. This does however come at a cost. Since the single breakends must be assembled, a traditional read mapping based caller will have greater sensitivity on low coverage samples as they do not suffer from the abrupt drop in sensitivity that VIRUSBreakead is subject to when there is insufficient coverage for reliable assembly. Similarly, above ~3000x viral genome coverage, the assembler used by VIRUSBreakend starts hitting assembly graph complexity limits thus making integration site detection somewhat unreliable for very highly expressed DNA viruses. Key to the extremely fast runtime performance of VIRUSBreakend is use of host-aligned input enabling the vast majority of reads to be immediately discarded. If a host-aligned input file was not available, the computational cost of VIRUSBreakend would increase by over an order of magnitude. For the vast majority of projects, this requirement is unproblematic as a host-aligned BAM/CRAM will be created for variant calling purposes. Since VIRUSBreakend has no constraints on the host reference used (other than they not contain the viral sequences), it is suitable for incorporation into an existing WGS pipeline. Sequences are considered to be of interest if either Kraken2 classifies the overall sequence with a viral taxid, any kmer is classified with a viral taxid and all kmers between that kmer are an ancestor of a viral taxid. That is, the sequence is entirely a virus of interest, or could be a split read containing a virus of interest and non-viral sequence (such as a read overlapping a host integration site). The originating reads names for sequences of interest are tracked and a second pass over the input file is performed to extract the entire originating fragment (i.e. both reads if paired-end sequencing) for all sequences of interest. A viral reference is created consisting of the most abundant human-infecting viral taxid for each genus. Here, we define 'most abundant' as a viral taxid of interest with the most reads directly assigned to that taxid by Kraken2, with ties broken by selecting the taxid with the most read assigned to that taxid or any of the descendent taxid. Only taxid with at least 50 sequences to it or a descendant and having an associated genome sequence in the kraken2 database are considered. If no taxid of interest reaches the 50 sequence threshold, processing terminates. In the case of multiple genomes associated with a taxid of interest, only the first contig is incorporated into the viral reference. Any accession included in the NCBI Viral Genomes 28 neighbours file ( https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=10239&cmd=download2 ) with host containing "human" is considered human-infecting. Extracted reads are aligned to the viral reference and SNVs called using bcftools call -c -v --ploidy 1 -V indels. The viral reference genome is then updated using bcftools consensus and extracted reads realigned to the new reference. To preserve viral coordinates, indels and SVs are not incorporated. Assembly/structural variant calling is performed using GRIDSS2. To improve library fragment size distribution estimation and GRIDSS quality score calculations, GRIDSS metrics are precomputed from the first 10,000,000 reads in the host-aligned input SAM/BAM/CRAM. Candidate host genome integration positions are identified by aligning the breakend inserted sequenced to host reference genome using bwa/gridss.AnnotateInsertedSequence which annotates variants with the nominal alignment, mapq, and also any alternative alignments reported in the bwa XA tag. The GRIDSS VCF is annotated with gridss_annotate_vcf_repeatmasker.sh and gridss_annotate_vcf_kraken2.sh using the same custom Kraken2 database used for viral sequence identification. The VCF is filtered to only single breakend variants in which the kraken2 classification of the single breakend sequence matches the host NCBI taxonomy ID (human, 9606). Finally, single breakend calls are transformed into host-virus breakpoint calls based on the first alignment position reported by bwa/gridss.AnnotateInsertedSequence with the alignment mapping quality score used to determine whether the called position is ambiguous or not. Tools VIRUSBreakend 2.10.2, BATVI 1.03, VERSE 2.0, ViFi (commit d56f4c2) and GRIDSS 2.10.2 were run with default settings using the installation procedures outlined in their user guide. All VERSE perl scripts were edit to use "#!/bin/usr/env perl" so as to be compatible with a conda installation, the base qual cutoff for the embedded CREST was reduced to 15 since wgsim defaults to 17 as a base quality score, and the undocumented perl dependencies were iteratively installed until execution no longer raised missing library errors. The missing step of creating bwa index in the BATVI batmis directory was run in addition to the installation instructions to circumvent the fatal error encountered building the indexes when following the documented instructions. BATVI full call set results (predictions.opt.subopt.txt) were not included as the recall of this call set was lower than the BATVI high confidence calls on the simulation data. The following modifications were required to get the most recent version (27 Mar 2019 d56f4c28) of ViFi to run: recommendation to use the supplied docker image was ignored since the docker command-line parsing logic was incorrect and crashed when using -b, and ignored to -v parameter thus always ran against HPV; crash bug in scripts/get_trans_new.py:238 was fixed by correcting the incorrect parenthesis on line 233; a conda environment was created using "conda create -n vifi bwa=0.7.17 python=2.7 pysam=0.15.2 samtools=1.9 hmmer". VIRUSBreakend was run using GRIDSS version 2.10.2, 4 threads and --rmargs "-e rmblast" since the conda installation of RepeatMasker does not perform RepeatMasker configuration. RepeatMasker 4.1.0 and Kraken 2.1.0 were installed from BioConda 29 . Reads were aligned to hg19 using bwa mem 0.7 and converted to bam and coordinate sorted using samtools 1.11. Reads associated with samples 145T, 177T, 180N, 186T, 198T, 26T, 200T, 268T, 43T, 46T, 70T, 71T, 95T in project ERP001196 were downloaded from SRA using fasterq-dump from sra-tools 2.10.8. ERR093473 and ERR173541 were excluded from analysis due non-transient fasterq-dump errors that could not be rectified. Since VIRUSBreakend supports multiple input files and GRIDSS performs per-file library fragment size distribution estimation, bam files were not merged. GRIDSS calls were annotated with gridss_annotate_kraken2.sh and filtered to single breakends with a viral taxid. VERSE and ViFi were not rerun and the results presented in their respective publications were taken as is. Simulated reads were generated using targeted insertion sequences between version of the 1.0 chr1 telomere-to-telomere consortium assembly of chm13 ( https://github.com/nanopore-wgs-consortium/CHM13 ) and LC500247.1. Each targeted inserted sequence was processed and called independently with insertions spaced evenly at every Mb of chrq chm13. 50kbp of chm13 from 1,000,000n -50,000, 1,000,000 was concatenated with 1kbp of LC500247.1 sequence from 4n to 4n + 2,000 which was then concatenated with 50kbp of chm13 from 1,000,000n + 10, 1,000,000 + 50,010. This resulted in an insertion of 2kbp of HBV, a 10bp gap between the left and right side of the integration, and 50kbp of human sequence flanking the insertion site. GRIDSS metrics from 20M simulating reads from chm13 using the same parameters were used to emulate realistic VIRUSBreakend WGS metrics calculations as no simulation data set contained the 10M reads used for metrics approximation. Read were simulated using ART 2.5.8 30 with parameters --noALN --paired --seqSys HSXn -ir 0 -ir2 0 -dr 0 -dr2 0 -k 0 -l 150 -m 500 -s 100 -rs 1. Separate data sets were generated with coverage (--fcov) of 5, 10, 15, 30, and 60. An E. Coli read pair and an E Coli/human read pair was appended to each fastq file to ensure ViFi did not crash. VIRUSBreakend was run with --minreads 15 to ensure the viral presence filtering did to interfere with this benchmark of integration detection capability. GRIDSS2 calls were filtered to single breakend variants that realigned to HBV. To account for coordinate differences between hg19 and CHM13 and LC500247.1 and the HBV references used by the callers, calls were considered a true positive if the hg19 host position was within 1Mb of the CHM13 truth position, and the viral position was within 750bp of the LC500247.1. Calls were considered homologous calls if the viral position matched and no full matches were found for that breakpoint. Duplicate or unmatched calls were considered false positives. GRIDSS2 breakpoint calls were generated by concatenating NC_003977.2 to hg19 aligning using bwa mem 0.7.17 and filtering to breakpoint calls involving NC_003977.2. VIRUSBreakend was run on 1988 samples in the Hartwig Medical Foundation cohort using pre-emptible c2-standard-4 (4 vCPUs, 16 GB memory) instances. --gridssargs "--jvmheap 13g" was specified since the GRIDSS2 JVM defaults to a 30GB heap size. TTV and xenotropic retroviruses were excluded from counts. VIRUSBreakend false negatives had insufficient coverage for the integration site to be assembled. c) Runtime on sample 177T from the hepatocellular carcinoma cohort when allocated 4 cores and 20GB memory. The cost of the initial host genome alignment is not included for VIRUSBreakend or VERSE as this is typically performed in a WGS pipeline regardless of whether viral integration detection is performed or not. Hartwig Medical Foundation cohort data was obtained from the Hartwig Medical Foundation (Data request DR-005) Viruses associated with human cancer Landscape of DNA virus associations across human malignant cancers: analysis of 3,775 cases using RNA-Seq The landscape of viral associations in human cancers Distinct viral reservoirs in individuals with spontaneous control of HIV-1 Pan-cancer whole-genome analyses of metastatic solid tumours BATVI: Fast, sensitive and accurate detection of virus integrations Detecting virus integration sites based on multiple related sequencing data by VirTect Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data VERSE: a novel approach to detect virus integration in host genomes through reference genome customization VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue Comprehensive comparative analysis of methods and software for identifying viral integrations ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer The variant call format and VCFtools Improved metagenomic analysis with Kraken 2 GRIDSS2: harnessing the power of phasing and single breakends in somatic structural variant detection Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma HPV-mediated inactivation of tumor suppressor p53 Detection of Epstein-Barr Virus Infection in Non-Small Cell Lung Cancer Herpesvirus telomeric repeats facilitate genomic integration into host telomeres and mobilization of viral DNA during reactivation Unbiased optical mapping of telomere-integrated endogenous human herpesvirus 6 Fast and accurate long-read alignment with Burrows-Wheeler transform GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly NCBI viral genomes resource Bioconda: sustainable and comprehensive software distribution for the life sciences ART: a next-generation sequencing read simulator This publication and the underlying study have been made possible partly on the basis of the data that Hartwig Medical Foundation and the Center of Personalised Cancer Treatment (CPCT) have made available to the study. The authors declare no competing interests.