A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing Jacob Househam, ​Barts Cancer Institute, Queen Mary University of London, UK William CH Cross , ​UCL Cancer Institute, University College London, UK (★) Giulio​ ​Caravagna ,​ ​Department of Mathematics and Geosciences, University of Trieste, Italy (★) Joint last authors. (★) Corresponding: ​(GC) ​gcaravagna@units.it​. Abstract. ​Cancer is a global health issue that places enormous demands on healthcare systems. Basic research, the development of targeted treatments, and the utility of DNA sequencing in clinical settings, have been significantly improved with the introduction of whole genome sequencing. However the broad applications of this technology come with complications. To date there has been very little standardisation in how data quality is assessed, leading to inconsistencies in analyses and disparate conclusions. Manual checking and complex consensus calling strategies often do not scale to large sample numbers, which leads to procedural bottlenecks. To address this issue, we present a quality control method that integrates point mutations, copy numbers, and other metrics into a single quantitative score. We demonstrate its power on 1,065 whole-genomes from a large-scale pan-cancer cohort, and on multi-region data of two colorectal cancer patients. We highlight how our approach significantly improves the generation of cancer mutation data, providing visualisations for cross-referencing with other analyses. Our approach is fully automated, designed to work downstream of any bioinformatic pipeline, and can automatise tool parameterization paving the way for fast computational assessment of data quality in the era of whole genome sequencing. Introduction Cancer remains an unsolved problem, and a key factor is that tumours develop as heterogeneous cellular populations ​(Greaves and Maley 2012; McGranahan and Swanton 2017, 2015)​. Cancer genomes can harbour multiple types of mutations compared to healthy cells ​(Macintyre et al. 2018; Martincorena et al. 2018, 2015; Nik-Zainal et al. 2012)​, and many of these events contribute to the pathogenesis of the disease, and therapeutic resistance. A popular design of studies intending to .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint mailto:gcaravagna@units.it https://paperpile.com/c/rqVmzs/Pf2t+5LH8+ZoHM https://paperpile.com/c/rqVmzs/Pf2t+5LH8+ZoHM https://paperpile.com/c/rqVmzs/P1Yv+uG2X+4mqr+bHGV https://paperpile.com/c/rqVmzs/P1Yv+uG2X+4mqr+bHGV https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. understand tumour development involves collecting tumour and matched-normal biopsies, and generating so-called “bulk” DNA sequencing data for both ​(Barnell et al. 2019)​. Using bioinformatic tools to cross reference the normal genome against the aberrant one, the mutations and heterogeneity thereof found in the tumour sample can be derived and used in other analyses. These analyses include, but are not limited to, driver mutation identification ​(Bailey et al. 2018; Gonzalez-Perez et al. 2013)​, which aims to discern the key aberrations that cause a tumour to grow, patient clustering, which aims to identify treatment groups with similar biological characteristics, and evolutionary inference ​(Gerstung et al. 2020; Nik-Zainal et al. 2012; Caravagna et al. 2020)​, which informs us how a particular tumour developed from normal cells. There are several types of mutations that we can retrieve from DNA sequencing data (Campbell et al. 2020)​. Broadly these can be categorized as single nucleotide variants (SNVs), copy number alterations (CNAs) and other more complex changes such as structural variants ​(Li et al. 2020)​. All types of mutations can drive tumour progression, and are therefore important entities to study ​(Kent and Green 2017-4; Levine, Jenkins, and Copeland 2019)​. Luckily, the steady drop in sequencing costs is fueling the creation of large amounts of data, which are becoming increasingly available for researchers to access through public databases. Notably, we are entering the era of high-resolution whole-genome sequencing (WGS), a technology that can read out the majority of a tumour genome, providing major improvements over whole-exome counterparts. Generating some of these data, however, poses challenges. While SNVs are the simplest type of mutations to detect using bioinformatic analysis and perhaps have the most well established supporting tools ​(Li et al. 2020)​, CNAs are particularly difficult to call since the baseline ploidy of the tumour (i.e., the number of chromosome copies) is usually unknown and has to be inferred from the data. CNAs are important types of cancer mutations; large-scale gain and loss of chromosome arms or sections of arms can confer tumour cells with large-scale phenotypic changes, and are often important clinical targets ​(Gerstung et al. 2020; Watkins et al. 11 2020)​. SNVs and CNAs are intertwined mutation groups. They can overlap within a tumour cell’s genome, meaning the number of copies of an SNV can be amplified or indeed reduced by CNAs. This depends on the ploidy of the genome regions overlapping with the variants. For instance, for a clonal - meaning present in every cell of the tumour sample - heterozygous SNV in a diploid tumour genome the expected variant allele frequency (VAF) is 50% (i.e., half of the reads from tumour cells will harbour the SNV). Alternatively, if each chromosome is present in three copies (triploid), the expected VAF is 33% - if the SNV occurred after the amplification - or 66% - if the SNV is on the amplified chromosome and occurred before the amplification. The theoretical .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/j5j7 https://paperpile.com/c/rqVmzs/j5j7 https://paperpile.com/c/rqVmzs/UEke+Glz6 https://paperpile.com/c/rqVmzs/vQgD+bHGV+chqB https://paperpile.com/c/rqVmzs/vQgD+bHGV+chqB https://paperpile.com/c/rqVmzs/CxXa https://paperpile.com/c/rqVmzs/tMOu https://paperpile.com/c/rqVmzs/df7V+SxXl https://paperpile.com/c/rqVmzs/df7V+SxXl https://paperpile.com/c/rqVmzs/tMOu https://paperpile.com/c/rqVmzs/vQgD+NCPJ https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. frequencies are observed with a Binomial noise model that depends on the depth of sequencing and the actual VAF ​(Nik-Zainal et al. 2012; Caravagna et al. 2020)​. We note that these VAFs hold for pure bulk tumour samples (100% tumour cells). Realistically, most bulk samples contain normal cells, the percentage of which shifts these theoretical frequencies towards lower values. These ideas are leveraged by methods that seek to compute the Cancer Cell Fractions (CCFs) of the tumour, i.e., a normalisation of the observed tumour VAF for the CNA, the number of copies of a mutation (mutation multiplicity) and tumour purity ​(Nik-Zainal et al. 2012)​. Many bioinformatics pipelines are designed to start from a BAM formatted input file and, following variant calling, extract the VAF of mutations while calling CNAs in parallel (Boeva et al. 2011; Cmero et al. 2020; Zaccaria and Raphael 2020; Van Loo et al. 2010)​. These analyses are nearly always decoupled, and can return inconsistent variant calls; i.e., CNAs and purity that mismatch the empirical VAF from the BAMs. Since CNAs and purity are inferred through various measurements that are subject to noise - i.e., mutation allele ratios, tumour-normal depth ratios and B-allele frequencies are prime examples - they are the most likely cause of error. While in some cases these errors can be spotted and fixed by manual intervention, this process is also subject to inconsistencies in the absence of a proper statistical framework, and does not scale in studies seeking to generate datasets with millions of data points ​(Campbell et al. 2020; Priestley et al. 2019; Turnbull et al. 2018)​. The intrinsic performance of a variant caller and sequencing noise therefore massively impacts CNA calling and purity inferences, propagating errors in downstream analysis that eventually lead to incorrect biological conclusions, becoming a crucial computational bottleneck in the era of high-resolution whole-genome sequencing. To solve these problems we developed CNAqc ( ​Data Availability​), a computational framework with a de novo statistical model to assess the conformance of expected SNVs, CNAs, and purity estimates. We strived to make the tool as simple to implement as possible, maximising compatibility across differing pipelines. CNAqc computes a quantitative quality check (QC) score for the overall agreement of the calls, which can be used to tune the parameters of callers (e.g., decrease purity or increase ploidy), or select among multiple CNA profiles (e.g., tetraploid versus diploid tumours) until a fit is achieved. In CNAqc we also integrate these measures to determine CCF values (Dentro, Wedge, and Van Loo 2017)​. CNAqc is implemented as a highly optimised R package that can be used downstream of any cancer mutation calling pipeline. It can be run on WGS data, and can automatically compute a QC score in a matter of seconds, which is an extremely useful .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/bHGV+chqB https://paperpile.com/c/rqVmzs/bHGV https://paperpile.com/c/rqVmzs/IX1R+ydMa+rmmC+yAgN https://paperpile.com/c/rqVmzs/IX1R+ydMa+rmmC+yAgN https://paperpile.com/c/rqVmzs/CxXa+67up+mWfz https://paperpile.com/c/rqVmzs/CxXa+67up+mWfz https://paperpile.com/c/rqVmzs/Uxwc https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. feature for large-scale genomics consortia that analyse many samples per day. To demonstrate the tool we analysed 11 bulk WGS datasets from two multi-region colorectal cancers, and analysed high-quality whole-genomes from the Pan 0651 Cancer Analysis of Whole Genomes (PCAWG) cohort ​(Campbell et al. 2020)​. Results The CNAqc framework CNAqc can perform different types of operations on CNAs and somatic mutation calls obtained from bulk WGS. In what follows, we will refer explicitly to SNVs as the main type of mutation used, but in principle other types of substitutions such as insertions or deletions also apply. The package supports the most common CNA copy types found in cancers: heterozygous normal states (1:1 chromosome complement), loss of heterozygosity (LOH) in monosomy (1:0) and copy-neutral (2:0) form, trisomy (2:1) or tetrasomy (2:2) gains. The tool also works with exome data, but the reduced mutational burden can, in general, lower the reliability of the QC score (​Supplementary Figure S1​). Many metrics output by CNAqc are derived from the link between copy-state profiles (i.e., the copies of the major and minor alleles, which sum up to the ploidy of a segment) and allele frequencies that are explicit from read counts. Combinatorial equations and frequency spectrum analysis can quantitatively determine if CNAs and purity are consistent with the VAF distribution ( ​Online methods ​). This score also suggests “corrections” to automatically fine-tune and repeat CNA calling runs. This works for tools that use either Bayesian priors or point estimates of the parameters. The key equations for a somatic mutation link its VAF and CCF , to sample purity , tumour ploidy , and ​, the number of copies of a mutation ( ​Figure 1a ​). Effectively, for complex 2:0, 2:1 and 2:2 copy states, phases mutations that were acquired before or after the copy number event ( ​Figure 1b ​). We remark that we observe , and infer , ​ and , finally deriving , which is difficult to estimate ( ​Figure 1c ​). In CNAqc we use the following formula for VAF (​Figure 1d​) and CCF .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/CxXa https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=c#0 https://www.codecogs.com/eqnedit.php?latex=%5Cpi#0 https://www.codecogs.com/eqnedit.php?latex=p#0 https://www.codecogs.com/eqnedit.php?latex=m%5Cin%5C%7B1%2C2%5C%7D#0 https://www.codecogs.com/eqnedit.php?latex=m#0 https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=%5Cpi#0 https://www.codecogs.com/eqnedit.php?latex=p#0 https://www.codecogs.com/eqnedit.php?latex=m#0 https://www.codecogs.com/eqnedit.php?latex=c#0 https://www.codecogs.com/eqnedit.php?latex=v%20%3D%20%5Cdfrac%7B%5Cpi%7D%7B2(1-%5Cpi)%20%2B%20%5Cpi%20p%7D#0 https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. These formulas lead to other interesting quantities ( ​Online methods ​). For instance, if we know tumour purity and the ploidy of a CNA segment, then the VAF mutations mapped to the segment must peak at a known location . The value for follows from x x combinatorial arguments relating all other variables ​(Nik-Zainal et al., 2012)​. From a QC perspective, the euclidean distance between the theoretical expectation and the x peaks observed from data is an error score that approaches 0 for perfect calls, and grows otherwise. CNAqc can visualise the input segments ( ​Figure 2a ​) and read counts ( ​Figure 2b-d ​). Other analysis such as CCFs computation and genome fragmentation analysis are also available, and have other visualisations (​Figure 2e​). The scores of CNAqc can be used to determine a QC PASS or FAIL status for every copy state within a tumour genome, weighting different evidence from the data. One score is for the quality of CNA segmentation and tumour purity, and one for CCF values. The former is based on a density-based analysis of the VAF distribution, and uses both a non-parametric kernel density and a univariate Binomial mixture to match peaks in the VAF data ( ​Figure 3a-d ​). The latter is based on the entropy of the latent variables in a Binomial mixture model, whose components are peaked at the expected VAF. From this density we identify VAF ranges for which it is hard to estimate the mutation multiplicity, and therefore the CCF of the mutation ( ​Figure 3e-h ​). To the best of our understanding, this is the only framework providing quantitative metrics for all the most widespread types of tumour mutations. Multi-region colorectal cancer data We have run CNAqc on previously published WGS multi-region data ​(Cross et al. 10 2018; Caravagna et al. 2020)​, which was collected from multiple regions of primary colorectal adenocarcinomas across two distinct patients. For all these samples we have high quality somatic mutation calls ​(Cross et al. 10 2018) that were obtained using CloneHD ​(Fischer et al. 2014)​. We have re-called CNAs with the Sequenza CNA caller (Favero et al. 2015)​, and sought out to check the inferred copy states and tumour purity with CNAqc, along with SNVs generated by Mutect2 ​(Benjamin et al. 2019)​. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=c%20%3D%20%5Cdfrac%7Bv%5B(p-2)%5Cpi%20%2B%202%5D%7D%7Bm%5Cpi%7D%20%5C%2C%20.#0 https://www.zotero.org/google-docs/?YaP3DC https://paperpile.com/c/rqVmzs/IC0y+chqB https://paperpile.com/c/rqVmzs/IC0y+chqB https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/A7Vg https://paperpile.com/c/rqVmzs/tCb6 https://paperpile.com/c/rqVmzs/bD5o https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Sequenza was run using distinct parameterizations. We begun with the default range proposals for purity and ploidy , which we then improved in a final run following CNAqc 1 analysis. We also forced a Sequenza fit with constrained tetraploid genome (ploidy equal 4), and one with low purity. All these steps could have been easily automatised in a procedure that runs the caller, obtains score metrics for the solution from CNAqc, and re-run the fits with adjusted parameters if required. The results for one sample of patient Set7 - Cancer 7 in the original manuscript ​(Cross et al. 10 2018) - are in ​Figure 4 ​; the other samples for patient Set7 are in ​Supplementary Figures S2-S4 ​. All samples for patient Set6 are in ​Supplementary Figures S5-S10. The peak detection scores produced by CNAqc invariably fail both the tetraploid and low-purity solutions, passing the others; the little adjustment suggested to the default parameters slightly improves the purity, but the overall quality is high even with just default parameters ( ​Figure 4b ​). The whole-genome CNA profile for this sample shows some degree of aneuploidy ( ​Figure 4c ​), and it is easy with CNAqc to assess miscalled CNA segments ahead of the VAF data ( ​Figure 4d ​). The analysis of all the samples available for Set7 shows an overall CNA profile with many diploid regions and mild aneuploidy ( ​Figure 4e ​), consistent with a microsatellite stable colorectal cancer ​(Cross et al. 10 2018)​. Large-scale pan cancer PCAWG calls We have run CNAqc on a subset of the full PCAWG cohort, which contains thousands of samples from multiple tumour types ​(Campbell et al. 2020)​. The median coverage of this cohort is 45x, with purity ~65% ​(Caravagna et al. 2020)​; a much lower resolution than the data available for the multi-region samples discussed in the previous section. Because of this, peak detection from the VAF distribution across some of the samples would be challenged by signal quality; in practice, for genomes with complex aneuploidy and massive drops in purity and coverage the VAF distribution is unsuitable for peak-detection, leading to false-positives in the QC process. To avoid this and work with suitable samples, we identified cases adopting the following conditions: (i) the 065n = 1 tumour type contains >20 samples, (ii) the tumour genome used for QC contains >30% of the overall SNVs in the tumour - so a substantial part of the overall mutational burden - and (iii) the purity of the sample is >60% - so the signal is suitable for peak detection. On a standard cluster CNAqc ran in less than 1 hour for these samples; notably the 1 Technically the default Sequenza values for ploidy reach maximum at 7; being unrealistic for our cases we limited the maximum ploidy to be 5. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/IC0y https://paperpile.com/c/rqVmzs/CxXa https://paperpile.com/c/rqVmzs/chqB https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. completion time (per sample) on a laptop is less than 1 minute, meaning that preliminary analysis can be carried out very quickly and without large computing infrastructures. The calls in PCAWG were obtained by consensus with multiple bioinformatics tools, and for this reason we expected them to be reliable. Manual inspections of some patient data showed indeed many high-quality calls, but also highlighted a variety of interesting cases. For instance, tumours with extremely low mutational burden but high quality calls still yielded a useful report, suggesting that CNAqc can work also with mutational burden from whole-exome sequencing ( ​Supplementary Figure S1 ​). For other tumours, we found high purity levels >90%, which are probably overestimated ( ​Supplementary Figure S11 ​) compared to others where purity is genuinely very high ( ​Supplementary Figure S12 ​). Overall, the scores from peak detection are reliable for the majority of the analysed samples ( ​Figure 5a ​) - the diploid 85% purity tumour in ​Figures 2 ​and 3 is taken from this list - with only a few cases requiring further checks ( ​Figure 5b ​). The peak detection by CNAqc therefore confirms the calls reliability in terms of breakpoints, segments ploidy and tumour purity. CCF computations showed a higher rate of failures with CNAqc analysis ( ​Figure 5a ​). This is inevitably due to the lack of signal separability stemming from low coverage of these samples, even for high-quality genomes. Therefore while peaks could be determined for these data, mutation multiplicity assessment would have required higher coverage than what was found available. In summary, from these analyses we revealed that the problem of validating CNA calls, compared to determining CCF estimates, can be approached with lower coverage and purity values using CNAqc. Discussion WGS is a powerful approach to detect extensive mutations that drive human cancers. Many large-scale initiatives such as PCAWG ​(Campbell et al. 2020)​, the Hartwig Medical Foundation ​(Priestley et al. 2019) and Genomics England ​(Turnbull et al. 2018) have already generated WGS data for thousands of cancer patients, with many cancer institutes converging towards these efforts. Calling mutations from WGS data requires complex bioinformatics pipelines ​(Barnell et al. 2019; Cmero et al. 2020; Li et al. 2020) and any downstream analysis relies upon these calls, putting the quality of the generated data under the spotlight. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/CxXa https://paperpile.com/c/rqVmzs/67up https://paperpile.com/c/rqVmzs/mWfz https://paperpile.com/c/rqVmzs/j5j7+ydMa+tMOu https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. CNAqc offers the first principle framework to control the quality of tumour mutation calls. The tool can analyse SNVs and more general types of nucleotide substitutions; SNVs are more reliable and depend less on alignment quality than other mutations, and therefore should be checked first. CNAqc uses a peak-detection analysis to validate CNA segments and purity, exploiting a combinatorial model for cancer alleles. Within the same framework, CNAqc also computes CCF values, highlighting mutations for which such values are uncertain. CNAqc features can be used to clean up data, automatising parameter choice for virtually any caller, prioritizing good calls and selecting information for downstream analyses. The CNAqc framework leverages the relationship between tumour VAF and ploidy. The quality of the control process itself depends on the ability to process the VAF spectrum and detect peaks. Therefore, if the VAF quality is very low because, e.g., the sample has low purity or coverage, the overall quality of the check decreases, making it more difficult to completely automate quality checking. However, for the large majority of samples, CNAqc provides a very effective and fast way to integrate quality metrics in standard pipelines. Generating high quality calls is just a prelude to more complex analyses that interpret cancer genotypes and their history, with and without therapy ​(Ding et al. 2012; Landau et al. 2013; Caravagna et al. 07 12, 2016; Jamal-Hanjani et al. 2017; Turajlic et al. 2018; Caravagna et al. 09 2018)​. CNAqc can pass a sample at an early stage, leaving the possibility of assessing, at a later stage, whether the quality of the data is high enough to approach specific research questions. With the ongoing implementation of large-scale sequencing efforts, CNAqc provides a good solution for modular pipelines that self-tune parameters, based on quality scores. To our knowledge, this is the first stand-alone tool which leverages the power of combining the most common types of cancer mutations - SNVs and CNAs - to automatically control the quality of WGS assays. We believe CNAqc can help reduce the burden of manual quality checking and parameter tuning. References Bailey, Matthew H., Collin Tokheim, Eduard Porta-Pardo, Sohini Sengupta, Denis Bertrand, Amila Weerasinghe, Antonio Colaprico, et al. 2018. “Comprehensive Characterization of Cancer Driver Genes and Mutations.” ​Cell​ 173 (2): 371–85.e18. https://doi.org/​10.1016/j.cell.2018.02.060 ​. Barnell, Erica K., Peter Ronning, Katie M. Campbell, Kilannin Krysiak, Benjamin J. Ainscough, Lana M. Sheta, Shahil P. Pema, et al. 2019. “Standard Operating Procedure for Somatic Variant Refinement of Sequencing Data with Paired Tumor and Normal Samples.” ​Genetics .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/wPG3+tqeT+Rl5f+CImd+JI4a+eR0S https://paperpile.com/c/rqVmzs/wPG3+tqeT+Rl5f+CImd+JI4a+eR0S https://paperpile.com/c/rqVmzs/wPG3+tqeT+Rl5f+CImd+JI4a+eR0S http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/UEke http://dx.doi.org/10.1016/j.cell.2018.02.060 http://paperpile.com/b/rqVmzs/UEke http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. in Medicine: Official Journal of the American College of Medical Genetics​ 21 (4): 972–81. https://doi.org/​10.1038/s41436-018-0278-z​. Benjamin, David, Takuto Sato, Kristian Cibulskis, Gad Getz, Chip Stewart, and Lee Lichtenstein. 2019. “Calling Somatic SNVs and Indels with Mutect2.” ​bioRxiv​, December, 861054. https://doi.org/​10.1101/861054 ​. Boeva, Valentina, Andrei Zinovyev, Kevin Bleakley, Jean-Philippe Vert, Isabelle Janoueix-Lerosey, Olivier Delattre, and Emmanuel Barillot. 2011. “Control-Free Calling of Copy Number Alterations in Deep-Sequencing Data Using GC-Content Normalization.” Bioinformatics ​ 27 (2): 268–69. https://doi.org/​10.1093/bioinformatics/btq635 ​. Campbell, Peter J., Gad Getz, Jan O. Korbel, Joshua M. Stuart, Jennifer L. Jennings, Lincoln D. Stein, Marc D. Perry, et al. 2020. “Pan-Cancer Analysis of Whole Genomes.” ​Nature​ 578 (7793): 82–93. https://doi.org/​10.1038/s41586-020-1969-6 ​. Caravagna, Giulio, Ylenia Giarratano, Daniele Ramazzotti, Ian Tomlinson, Trevor A. Graham, Guido Sanguinetti, and Andrea Sottoriva. 09 2018. “Detecting Repeated Cancer Evolution from Multi-Region Tumor Sequencing Data.” ​Nature Methods​ 15 (9): 707–14. https://doi.org/​10.1038/s41592-018-0108-x​. Caravagna, Giulio, Alex Graudenzi, Daniele Ramazzotti, Rebeca Sanz-Pamplona, Luca De Sano, Giancarlo Mauri, Victor Moreno, Marco Antoniotti, and Bud Mishra. 07 12, 2016. “Algorithmic Methods to Infer the Evolutionary Trajectories in Cancer Progression.” Proceedings of the National Academy of Sciences of the United States of America​ 113 (28): E4025–34. https://doi.org/​10.1073/pnas.1520213113 ​. Caravagna, Giulio, Timon Heide, Marc J. Williams, Luis Zapata, Daniel Nichol, Ketevan Chkhaidze, William Cross, et al. 2020. “Subclonal Reconstruction of Tumors by Using Machine Learning and Population Genetics.” ​Nature Genetics​ 52 (9): 898–907. https://doi.org/​10.1038/s41588-020-0675-5 ​. Cmero, Marek, Ke Yuan, Cheng Soon Ong, Jan Schröder, Niall M. Corcoran, Tony Papenfuss, Christopher M. Hovens, Florian Markowetz, and Geoff Macintyre. 2020. “Inferring Structural Variant Cancer Cell Fraction.” ​Nature Communications​ 11 (1): 730. https://doi.org/​10.1038/s41467-020-14351-8 ​. Cortés-Ciriano, Isidro, Jake June-Koo Lee, Ruibin Xi, Dhawal Jain, Youngsook L. Jung, Lixing Yang, Dmitry Gordenin, et al. 2020. “Comprehensive Analysis of Chromothripsis in 2,658 Human Cancers Using Whole-Genome Sequencing.” ​Nature Genetics​ 52 (3): 331–41. https://doi.org/​10.1038/s41588-019-0576-7 ​. Cross, William, Michal Kovac, Ville Mustonen, Daniel Temko, Hayley Davis, Ann-Marie Baker, Sujata Biswas, et al. 10 2018. “The Evolutionary Landscape of Colorectal Tumorigenesis.” Nature Ecology & Evolution​ 2 (10): 1661–72. https://doi.org/​10.1038/s41559-018-0642-z​. Dentro, Stefan C., David C. Wedge, and Peter Van Loo. 2017. “Principles of Reconstructing the Subclonal Architecture of Cancers.” ​Cold Spring Harbor Perspectives in Medicine​ 7 (8). https://doi.org/​10.1101/cshperspect.a026625 ​. Ding, Li, Timothy J. Ley, David E. Larson, Christopher A. Miller, Daniel C. Koboldt, John S. Welch, Julie K. Ritchey, et al. 2012. “Clonal Evolution in Relapsed Acute Myeloid Leukaemia Revealed by Whole-Genome Sequencing.” ​Nature​ 481 (7382): 506–10. https://doi.org/​10.1038/nature10738 ​. Favero, F., T. Joshi, A. M. Marquard, N. J. Birkbak, M. Krzystanek, Q. Li, Z. Szallasi, and A. C. Eklund. 2015. “Sequenza: Allele-Specific Copy Number and Mutation Profiles from Tumor Sequencing Data.” ​Annals of Oncology: Official Journal of the European Society for Medical Oncology / ESMO​ 26 (1): 64–70. https://doi.org/​10.1093/annonc/mdu479 ​. Fischer, Andrej, Ignacio Vázquez-García, Christopher J. R. Illingworth, and Ville Mustonen. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/j5j7 http://dx.doi.org/10.1038/s41436-018-0278-z http://paperpile.com/b/rqVmzs/j5j7 http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/bD5o http://dx.doi.org/10.1101/861054 http://paperpile.com/b/rqVmzs/bD5o http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/IX1R http://dx.doi.org/10.1093/bioinformatics/btq635 http://paperpile.com/b/rqVmzs/IX1R http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/CxXa http://dx.doi.org/10.1038/s41586-020-1969-6 http://paperpile.com/b/rqVmzs/CxXa http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/eR0S http://dx.doi.org/10.1038/s41592-018-0108-x http://paperpile.com/b/rqVmzs/eR0S http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/Rl5f http://dx.doi.org/10.1073/pnas.1520213113 http://paperpile.com/b/rqVmzs/Rl5f http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/chqB http://dx.doi.org/10.1038/s41588-020-0675-5 http://paperpile.com/b/rqVmzs/chqB http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/ydMa http://dx.doi.org/10.1038/s41467-020-14351-8 http://paperpile.com/b/rqVmzs/ydMa http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/FjZP http://dx.doi.org/10.1038/s41588-019-0576-7 http://paperpile.com/b/rqVmzs/FjZP http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/IC0y http://dx.doi.org/10.1038/s41559-018-0642-z http://paperpile.com/b/rqVmzs/IC0y http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/Uxwc http://dx.doi.org/10.1101/cshperspect.a026625 http://paperpile.com/b/rqVmzs/Uxwc http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/wPG3 http://dx.doi.org/10.1038/nature10738 http://paperpile.com/b/rqVmzs/wPG3 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/tCb6 http://dx.doi.org/10.1093/annonc/mdu479 http://paperpile.com/b/rqVmzs/tCb6 http://paperpile.com/b/rqVmzs/A7Vg https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. 2014. “High-Definition Reconstruction of Clonal Composition in Cancer.” ​Cell Reports​ 7 (5): 1740–52. https://doi.org/​10.1016/j.celrep.2014.04.055 ​. Gerstung, Moritz, Clemency Jolly, Ignaty Leshchiner, Stefan C. Dentro, Santiago Gonzalez, Daniel Rosebrock, Thomas J. Mitchell, et al. 2020. “The Evolutionary History of 2,658 Cancers.” ​Nature​ 578 (7793): 122–28. https://doi.org/​10.1038/s41586-019-1907-7 ​. Gonzalez-Perez, Abel, Christian Perez-Llamas, Jordi Deu-Pons, David Tamborero, Michael P. Schroeder, Alba Jene-Sanz, Alberto Santos, and Nuria Lopez-Bigas. 2013. “IntOGen-Mutations Identifies Cancer Drivers across Tumor Types.” ​Nature Methods​ 10 (11): 1081–82. https://doi.org/​10.1038/nmeth.2642 ​. Greaves, Mel, and Carlo C. Maley. 2012. “Clonal Evolution in Cancer.” ​Nature​ 481 (7381): 306–13. https://doi.org/​10.1038/nature10762 ​. Jamal-Hanjani, Mariam, Gareth A. Wilson, Nicholas McGranahan, Nicolai J. Birkbak, Thomas B. K. Watkins, Selvaraju Veeriah, Seema Shafi, et al. 2017. “Tracking the Evolution of Non-Small-Cell Lung Cancer.” ​The New England Journal of Medicine​ 376 (22): 2109–21. https://doi.org/​10.1056/NEJMoa1616288 ​. Kent, David G., and Anthony R. Green. 2017-4. “Order Matters: The Order of Somatic Mutations Influences Cancer Evolution.” ​Cold Spring Harbor Perspectives in Medicine​ 7 (4). https://doi.org/​10.1101/cshperspect.a027060 ​. Landau, Dan A., Scott L. Carter, Petar Stojanov, Aaron McKenna, Kristen Stevenson, Michael S. Lawrence, Carrie Sougnez, et al. 2013. “Evolution and Impact of Subclonal Mutations in Chronic Lymphocytic Leukemia.” ​Cell​ 152 (4): 714–26. https://doi.org/​10.1016/j.cell.2013.01.019 ​. Levine, Arnold J., Nancy A. Jenkins, and Neal G. Copeland. 2019. “The Roles of Initiating Truncal Mutations in Human Cancers: The Order of Mutations and Tumor Cell Type Matters.” ​Cancer Cell​ 35 (1): 10–15. https://doi.org/​10.1016/j.ccell.2018.11.009 ​. Li, Yilong, Nicola D. Roberts, Jeremiah A. Wala, Ofer Shapira, Steven E. Schumacher, Kiran Kumar, Ekta Khurana, et al. 2020. “Patterns of Somatic Structural Variation in Human Cancer Genomes.” ​Nature​ 578 (7793): 112–21. https://doi.org/​10.1038/s41586-019-1913-9 ​. Macintyre, Geoff, Teodora E. Goranova, Dilrini De Silva, Darren Ennis, Anna M. Piskorz, Matthew Eldridge, Daoud Sie, et al. 2018. “Copy Number Signatures and Mutational Processes in Ovarian Carcinoma.” ​Nature Genetics​ 50 (9): 1262–70. https://doi.org/​10.1038/s41588-018-0179-8 ​. Martincorena, Iñigo, Joanna C. Fowler, Agnieszka Wabik, Andrew R. J. Lawson, Federico Abascal, Michael W. J. Hall, Alex Cagan, et al. 2018. “Somatic Mutant Clones Colonize the Human Esophagus with Age.” ​Science​ 362 (6417): 911–17. https://doi.org/​10.1126/science.aau3879 ​. Martincorena, Iñigo, Amit Roshan, Moritz Gerstung, Peter Ellis, Peter Van Loo, Stuart McLaren, David C. Wedge, et al. 2015. “High Burden and Pervasive Positive Selection of Somatic Mutations in Normal Human Skin.” ​Science​ 348 (6237): 880–86. https://doi.org/​10.1126/science.aaa6806 ​. McGranahan, Nicholas, and Charles Swanton. 2015. “Biological and Therapeutic Impact of Intratumor Heterogeneity in Cancer Evolution.” ​Cancer Cell​ 27 (1): 15–26. https://doi.org/​10.1016/j.ccell.2014.12.001 ​. ———. 2017. “Clonal Heterogeneity and Tumor Evolution: Past, Present, and the Future.” ​Cell 168 (4): 613–28. https://doi.org/​10.1016/j.cell.2017.01.018 ​. Nik-Zainal, Serena, Peter Van Loo, David C. Wedge, Ludmil B. Alexandrov, Christopher D. Greenman, King Wai Lau, Keiran Raine, et al. 2012. “The Life History of 21 Breast Cancers.” ​Cell​ 149 (5): 994–1007. https://doi.org/​10.1016/j.cell.2012.04.023 ​. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/A7Vg http://dx.doi.org/10.1016/j.celrep.2014.04.055 http://paperpile.com/b/rqVmzs/A7Vg http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/vQgD http://dx.doi.org/10.1038/s41586-019-1907-7 http://paperpile.com/b/rqVmzs/vQgD http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Glz6 http://dx.doi.org/10.1038/nmeth.2642 http://paperpile.com/b/rqVmzs/Glz6 http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/Pf2t http://dx.doi.org/10.1038/nature10762 http://paperpile.com/b/rqVmzs/Pf2t http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/CImd http://dx.doi.org/10.1056/NEJMoa1616288 http://paperpile.com/b/rqVmzs/CImd http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/df7V http://dx.doi.org/10.1101/cshperspect.a027060 http://paperpile.com/b/rqVmzs/df7V http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/tqeT http://dx.doi.org/10.1016/j.cell.2013.01.019 http://paperpile.com/b/rqVmzs/tqeT http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/SxXl http://dx.doi.org/10.1016/j.ccell.2018.11.009 http://paperpile.com/b/rqVmzs/SxXl http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/tMOu http://dx.doi.org/10.1038/s41586-019-1913-9 http://paperpile.com/b/rqVmzs/tMOu http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/P1Yv http://dx.doi.org/10.1038/s41588-018-0179-8 http://paperpile.com/b/rqVmzs/P1Yv http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/uG2X http://dx.doi.org/10.1126/science.aau3879 http://paperpile.com/b/rqVmzs/uG2X http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/4mqr http://dx.doi.org/10.1126/science.aaa6806 http://paperpile.com/b/rqVmzs/4mqr http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/ZoHM http://dx.doi.org/10.1016/j.ccell.2014.12.001 http://paperpile.com/b/rqVmzs/ZoHM http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/5LH8 http://dx.doi.org/10.1016/j.cell.2017.01.018 http://paperpile.com/b/rqVmzs/5LH8 http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://paperpile.com/b/rqVmzs/bHGV http://dx.doi.org/10.1016/j.cell.2012.04.023 http://paperpile.com/b/rqVmzs/bHGV https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Priestley, Peter, Jonathan Baber, Martijn P. Lolkema, Neeltje Steeghs, Ewart de Bruijn, Charles Shale, Korneel Duyvesteyn, et al. 2019. “Pan-Cancer Whole-Genome Analyses of Metastatic Solid Tumours.” ​Nature​ 575 (7781): 210–16. https://doi.org/​10.1038/s41586-019-1689-y​. Turajlic, Samra, Hang Xu, Kevin Litchfield, Andrew Rowan, Stuart Horswell, Tim Chambers, Tim O’Brien, et al. 2018. “Deterministic Evolutionary Trajectories Influence Primary Tumor Growth: TRACERx Renal.” ​Cell​ 173 (3): 595–610.e11. https://doi.org/​10.1016/j.cell.2018.03.043 ​. Turnbull, Clare, Richard H. Scott, Ellen Thomas, Louise Jones, Nirupa Murugaesu, Freya Boardman Pretty, Dina Halai, et al. 2018. “The 100 000 Genomes Project: Bringing Whole Genome Sequencing to the NHS.” ​BMJ ​ 361 (April): k1687. https://doi.org/​10.1136/bmj.k1687 ​. Van Loo, Peter, Silje H. Nordgard, Ole Christian Lingjærde, Hege G. Russnes, Inga H. Rye, Wei Sun, Victor J. Weigman, et al. 2010. “Allele-Specific Copy Number Analysis of Tumors.” Proceedings of the National Academy of Sciences of the United States of America​ 107 (39): 16910–15. https://doi.org/​10.1073/pnas.1009843107 ​. Watkins, Thomas B. K., Emilia L. Lim, Marina Petkovic, Sergi Elizalde, Nicolai J. Birkbak, Gareth A. Wilson, David A. Moore, et al. 11 2020. “Pervasive Chromosomal Instability and Karyotype Order in Tumour Evolution.” ​Nature​ 587 (7832): 126–32. https://doi.org/​10.1038/s41586-020-2698-6 ​. Zaccaria, Simone, and Benjamin J. Raphael. 2020. “Accurate Quantification of Copy-Number Aberrations and Whole-Genome Duplications in Multi-Sample Tumor Sequencing Data.” Nature Communications​ 11 (1): 4301. https://doi.org/​10.1038/s41467-020-17967-y​. Data Availability Multiregion ​colorectal cancer data is deposited in EGA under accession number EGAS00001003066. PCAWG calls are publicly available at ( ​https://dcc.icgc.org/​), the ICGC Data Portal. CNAqc is implemented as an open source R package that is hosted at the GitHub space of the Caravagna Lab https://caravagnalab.github.io/CNAqc/​. The tool webpage contains RMarkdown tutorial vignettes to run CNAqc analysis of a generic dataset, as well as documents that explain visualisation and parameterizations of the execution. All analyses in this paper can be replicated following the vignettes. Authors contribution .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/67up http://dx.doi.org/10.1038/s41586-019-1689-y http://paperpile.com/b/rqVmzs/67up http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/JI4a http://dx.doi.org/10.1016/j.cell.2018.03.043 http://paperpile.com/b/rqVmzs/JI4a http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/mWfz http://dx.doi.org/10.1136/bmj.k1687 http://paperpile.com/b/rqVmzs/mWfz http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/yAgN http://dx.doi.org/10.1073/pnas.1009843107 http://paperpile.com/b/rqVmzs/yAgN http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/NCPJ http://dx.doi.org/10.1038/s41586-020-2698-6 http://paperpile.com/b/rqVmzs/NCPJ http://paperpile.com/b/rqVmzs/rmmC http://paperpile.com/b/rqVmzs/rmmC http://paperpile.com/b/rqVmzs/rmmC http://paperpile.com/b/rqVmzs/rmmC http://dx.doi.org/10.1038/s41467-020-17967-y http://paperpile.com/b/rqVmzs/rmmC https://dcc.icgc.org/ https://caravagnalab.github.io/CNAqc/ https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. All authors conceived the method, which GC formalised and implemented. All authors analysed the data and wrote the manuscript. Competing interests. The authors declare no competing interests. Online methods CNAqc supports two human genome references (GRCh38 and hg19), and the most common CNA profiles found in cancers: ● heterozygous diploid states (1:1) ; 2 ● loss of heterozygosity (LOH) in monosomy (1:0) and copy-neutral (2:0) states; ● triploid (AAB or 2:1) or tetraploid (AABB or 2:2) states. We make a simplifying assumption, whereby CNAs have been acquired in one step, starting from a simple heterozygous diploid state (the germline). For this reason, for tetraploid segments we only consider copy state 2:2, instead of 3:1 or 4:0. This allows us to make simpler computations. In practice, we avoid working with copy states for which the computation of CCFs is very difficult, and that are quite unlikely to be observed in real data. Also, we consider only clonal CNA segments. While subclonal CNA segments are certainly important for cancer genomics, the calls that we seek to quality check regard just clonal CNA events; being the one most prevalent in the majority of cancer cells, they have to be prioritised, with subclonal CNAs being only reliable for tumours with good clonal CNA calls. CNAqc works primarily with Whole-Genome Sequencing (WGS) data. For exome data, the reduced exonic mutation burden can make it more difficult to work with the spectrum of the VAF distribution. In general, the key determinant to detect peaks in the VAF, is the number of mutations per copy state. For tumours with strong endogenous mutant factors (e.g., smoking) or very high mutation rate (e.g., microsatellite unstable tumours), the number of exonic mutations could be high enough to use CNAqc. Peak-detection QC 2 The notation 1:1 is sometimes analogously expressed as genotype AB, 1:0 as A, 2:1 as AAB and 2:2 as AABB. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. We consider a somatic mutation ​present in 𝑚 copies of the tumour genome, when the sample purity is 𝜋 and the segment ploidy is 𝑝. Note that can be computed summing p the total number of copies of the minor and major allele at the mutation locus (​Figure 1 ​). The key equations for the expected VAF of a clonal mutation and its CCF are presented in the Main Text. Here we discuss how ​peaks can be used to QC both tumour purity and CNA segments and, consequently, overall tumour ploidy. From a QC perspective, if we solve for and the equations, we can get as which means that if we know tumour purity and CNA, we expect a peak at VAF , for a given value of , in the data distribution ( ​Figure 1a and ​1b ​). For instance, for a 1:1 segment ( ), the expected VAF for a heterozygous clonal ( ) mutation is 25% p = 2 m = 1 for a 50%-purity tumour, and 50% for a 100%-purity tumour. Similarly, for a 2:2 genome ( ) of a tumour with 75% purity, the expected VAF for clonal mutations accruedp = 4 before genome doubling and therefore visible in two copies ( ) is ~54%, while for m = 2 those accrued after genome doubling, and therefore present in single copy ( ), we m = 1 expect a ~21% VAF ​(Dentro, Wedge, and Van Loo 2017)​. CNAqc checks the data for peaks at these VAFs, with a tolerance . From the distance between the theoretical expectation and the estimator derived from data, we obtain an error metric for the calls. CNAqc first performs peak detection from the input VAF with two, separate, methods: 1. Via a kernel density estimation with fixed bandwidth, which is used to determine a smooth density profile. Peaks are then estimated from the discretized smooth, using specialised R packages for peak-detection and removing peaks with density below a parameterized cutoff. 2. Via Binomial mixture from the BMix ​(Caravagna et al. 2020) package ( ​https://caravagn.github.io/BMix/​), a peak is associated with each Binomial probability, for all mixture components . Peaks are matched to the expected theoretical values based on their euclidean distance. A theoretical peak can be matched to the closest peak in the data, or the one to the most right side of the frequency spectrum. This latter strategy works only if there are no miscalled CNAs. The first strategy (closest match), is the default CNAqc choice. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://www.codecogs.com/eqnedit.php?latex=m#0 https://www.codecogs.com/eqnedit.php?latex=%5Cpi#0 https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=v%20%3D%20%5Cdfrac%7Bv%5B(p-2)%5Cpi%20%2B%202%5D%7D%7Bm%5Cpi%7D%20#0 https://www.codecogs.com/eqnedit.php?latex=v#0 https://www.codecogs.com/eqnedit.php?latex=m#0 https://paperpile.com/c/rqVmzs/Uxwc https://www.codecogs.com/eqnedit.php?latex=%5Cepsilon%3E0#0 https://paperpile.com/c/rqVmzs/chqB https://caravagn.github.io/BMix/ https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. For every peak a QC value (PASS or FAIL) is determined based on some tolerance . The overall QC status of copy states with multiple peaks is the QC of the peakε > 0 with most mutations underneath. The overall QC status for a sample with many copy states is determined by summing up the QC status of individual copy states, and weighting them by the number of mutations associated (majority rule). CCF estimation CNAqc can compute CCFs in two ways. One of the two uses the idea of the mixture highlighted in ​Figure 1c ​, the other is simpler and works better when data resolution is low, and the entropy of the mixture model would leave too many mutations unassigned. For the mixture approach, we build a 2-components Binomial mixture from the theoretical expectations and the data. This implicitly assumes that peaks have been QCed first. We constraint the success parameters to match the expected VAF, and use the proportion of mutations that appear underneath a peak as mixing proportions . π Then, from the latent variables of the model we compute the probability of assigning a z mutation with VAF to cluster ,xn c .(z | θ, )p n,k = c π From this information we obtain the entropy of , which is low for values that are (z)H z assignable to only one cluster. Recall in this respect that the maximum entropy distribution is the uniform one, which is when a mutation can be equally likely in 1 or 2 copies, based on VAF. We use a simple peak detection heuristic to find 2 points of changes in ; in (z)H between those values we cannot reliably assess , i.e. assess if the mutation is in m single or double copy. For these CNAqc leaves the CCF value as NA. The alternative approach uses a simpler idea, still working on the expected theoretical VAF. Here instead of fitting a mixture we determine the midpoint , between the two o expected theoretical VAF peaks. The midpoint is computed by weighting each of the two peaks proportionally to the number of mutations that appear underneath each peak. The midpoint is a cut: values below are in single copy, values above in two. This o procedure requires data with good sequencing coverage, and a good general quality. When mutation multiplicities have been determined, CCF computation is trivial, and follows the formula presented in the Main Text. A QC PASS status is assigned to the .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. CCF values for a copy state, if less than 10% (or any custom threshold) are unassigned. The overall sample is given a QC status based on a majority policy. Genome fragmentation Some recently identified patterns of somatic CNA changes can be attributed to the presence of highly fragmented tumour genomes, termed chromothripsis and chromoplexy, or localised hypermutation patterns, termed kataegis ​(Cortés-Ciriano et al. 2020)​. While these can be identified using dedicated bioinformatics tools, CNAqc offers a simple statistical test to detect the presence of over-fragmentation in a chromosome arm, a prerequisite that could point to the presence of such patterns. The test works at the level of each chromosome arm (1p, 1q, 2p, 2q, etc.), and uses the length of each input CNA segment to assign a “long segment” or “short segment” status. This is determined by a cut parameter that is set, by default, to 20% (i.e., ). μ .2μ = 0 Then, a null hypothesis is used to compute a p-value. That is defined using a Binomial test based on , the number of trials given by the total segment counts in the arm, and k the observed number of short segments . The Binomial distribution for is defined s H0 by , and the null is the probability of observing at least short segments, a one-tailed μ s test for whether the observations are biased towards shorter segments. The p-value is adjusted for family-wise error rate by Bonferroni, dividing the desired -value by the α number of tests. This test is applied to a subset of chromosome arms with a minimum number of segments, and that “jump” in ploidy by a minimum amount (empirical default values estimated from trial data). The arm-level jump is determined as the sum of the difference between the ploidy of two consecutive DNA segments. These covariates are similar to those used to infer CNA signatures from single-cell low-pass WGS ​(Macintyre et al. 2018) ​. Other features CNAqc contains multiple functions to subset the data (i.e., select mutations that map only to certain copy states, subset CNAs with a total ploidy, etc.), visualise the data (i.e., plot mutational burden by tumour genome) or smooth the input CNA segments. Smoothing is an operation that can be carried out before testing for over-fragmentation. In CNAqc, by smoothing we obtain that two contiguous segments are merged if they have exactly the same ploidy profile (i.e. same numbers for the major and minor .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://paperpile.com/c/rqVmzs/FjZP https://paperpile.com/c/rqVmzs/FjZP https://paperpile.com/c/rqVmzs/P1Yv https://paperpile.com/c/rqVmzs/P1Yv https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. alleles), and if they are a maximum distance apart (e.g. 1 megabase). This operation does not affect the ploidy profile of the calls, but reduces the amount of breakpoints that would inflate the p-value of the Binomial over-fragmentation test. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Main Text Figures Figure 1. a. ​Theoretical ​VAF histogram for diploid 1:1 mutations in a tumour. A clonal heterozygous mutation has 50% VAF; all mutations are observed with some Binomial sequencing noise. The clonal mutations form a peak at 100% CCF, plus other features that characterise the tumour clonal composition (e.g., the tail). The expected theoretical VAF decreases if sample purity reduces. ​b. The case of a 2:1 tumour genome, where we expect 2 peaks in the VAF originating from mutations present in one (orange) or two copies (purple). The multiplicity of a mutation can phase whether it happened before or after the CNA. For 2:1 we expect peaks at 66% and 33% VAF, both clonal mutations (100% CCF). ​c. ​Computing CCFs requires caution for mutations with different multiplicities; we support 2:0, 2:1 and 2:2 copy states in CNAqc, and offer two methods to compute CCFs. The one depicted is based on the entropy of a Binomial mixture. From the expected VAF peaks we construct a mixture density and use the entropy of its latent variables to capture uncertainty in the multiplicities. At the crossing of the components we cannot easily assign multiplicities, and therefore CCFs; the entropy peaks at the top of the uncertainty by definition. ​d. ​Heatmap expressing the relationship between copy states, mutation multiplicity and sample purity. The color reflects the expected VAF for the corresponding mutations, and can be used to QC both CNAs and purity estimates. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Figure 2. a. Genome-wide total clonal copy number segments for a PCAWG cancer sample with overall ploidy 2, and sample purity ~85%. The panel is composed of three illustrations. The bottom plot reports the copies of the major and minor alleles in each segment, and some genome areas are shaded. The central plot shows genome-wide somatic mutations with their depth of sequencing, and the top plot shows the total number of mappable mutations binned every megabase. ​b. Variant Allele Frequencies (VAFs) for the mutations that map to the input segments (note that these are all SNVs). ​c. ​Depth of sequencing (DP) for every SNV. d. Number of reads (NV) with the variant allele for every SNV. e. ​Cancer Cell Fractions (CCF) estimation for this sample, obtained from CNAqc. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Figure 3. a-d. Peak detection analysis assessing the quality of CNA segments (split by copy state), and tumour purity. The shaded gray area are input mutations, and the thin black profile is its kernel density estimation (KDE). The black circles represent the peaks detected from the KDE, and the vertical dashed lines are the expected peaks, given the tumour purity. If the data peaks fall within the shaded area surrounding the vertical line, the estimates are consistent and the plot is therefore green (QC pass). For copy states with total copy number >2, multiple peaks are checked independently. In that case the overall QC status for the copy state is a linear combination of the results, weighted by the number of mutations assignable to each peak. ​e-h. Cancer Cell Fractions (CCF) estimation for each tumour genome, using the entropy method. Each plot shows both CCF, and the VAF from which mutation multiplicities are computed. In the rightmost panel we overlay the entropy profile computed by a 2-dimensional Binomial mixture. Areas within the red vertical dashed lines are those for which CNAqc cannot assign a confident CCF value. For copy states 1:0 and 1:1 the mutation multiplicity is fixed to 1 by definition. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Figure 4. a. Circos plot for four possible whole-genome CNA segmentations determined by Sequenza with WGS data (~80x median coverage, purity 87%). The input sample is Set7_57, one of four multi-region biopsies for colorectal cancer patient Set7. The first run is with default Sequenza parameters. With CNAqc, we slightly adjust purity estimation and obtain a final run of the tool. We also one run forcing overall tumour ploidy to 4 (tetraploid), and one with maximum tumour purity 60%. ​b. Purity and ploidy estimation for the four Sequenza runs. Arrows show the adjustment proposed by CNAqc, the default and final runs are the only ones to pass QC. ​c. Final run with perfect results for Set7_57: copy number segments, depth of coverage per mutation and mutation density per megabase. ​d. ​Miscalled copy-neutral LOH segment, obtained by forcing a tetraploid solution in Sequenza. For a 2:0 segment with the estimated Sequenza .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. purity we expected peaks at ~60% and ~30% VAF, which cannot be matched. ​e. CNA calling with CNAqc and Sequenza for 4 WGS biopsies of the primary colorectal cancer Set7. Figure 5. a. Summary CNAqc pass or fail barplot for top-quality PCAWG samples 065 n = 1 across distinct tumour types. Failures for peaks are with a 3% error tolerance, and CCFs with 10% of SNVs not assignable, per copy state. ​b. ​Zoom peak analysis with a scatter showing, for .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. every tumour type, the total cases per tumour against the proportion of pass or fails; each dot size is proportional to the error measure from mismatched peaks. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figures Supplementary Figure S1. ​PCAWG sample with low mutational burden. Supplementary Figure S2. ​Sample​ ​Set7_55 (multi-region). Supplementary Figure S3. ​Sample​ ​Set7_59 (multi-region). Supplementary Figure S4. ​Sample​ ​Set7_62 (multi-region). Supplementary Figure S5. ​Sample​ ​Set6_42 (multi-region). Supplementary Figure S6. ​Sample​ ​Set6_44 (multi-region). Supplementary Figure S7. ​Sample​ ​Set6_45 (multi-region). Supplementary Figure S8. ​Sample​ ​Set6_46 (multi-region). Supplementary Figure S9. ​Sample​ ​Set6_47 (multi-region). Supplementary Figure S10. ​Sample​ ​Set6_48 (multi-region). Supplementary Figure S11. ​PCAWG sample with overstimated 100% purity. Supplementary Figure S12. ​PCAWG sample with true 99% purity. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S1. ​Example PCAWG medulloblastoma sample with low-mutational burden, which passes data QC with CNAqc. ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). Note that this sample has only 76 SNVs in diploid tumour regions, like we observe in whole-exome assays. ​b,c. Peak analysis and CCF computation for diploid SNVs. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S2. ​Colorectal multi-region sample Set7_55 for patient Set7 (see also Main Text ​Figure 4 ​). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c. Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S3. ​Colorectal multi-region sample Set7_59 for patient Set7 (see also Main Text ​Figure 4 ​). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c. Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S4. ​Colorectal multi-region sample Set7_62 for patient Set7 (see also Main Text ​Figure 4 ​). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S5. ​Colorectal multi-region sample Set6_42 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S6. ​Colorectal multi-region sample Set6_44 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S7. ​Colorectal multi-region sample Set6_45 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S8. ​Colorectal multi-region sample Set6_46 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S9. ​Colorectal multi-region sample Set6_47 for patient Set6. ​a. Data for the sample (genome-wide CNA segments, CCF and read counts distribution). b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure S10. ​Colorectal multi-region sample Set6_48 for patient Set6. a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b,c.​ Peak analysis and CCF computation for the sample. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. Supplementary Figure 11. ​Example PCAWG sample with purity of 100%. ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b. This sample has 75% of its SNVs in diploid tumour regions, where a small peak is detectable at the expected purity. The VAF clearly peaks at ~10%, possibly suggesting a purity of 20% or lower, rather than 100%. Further doubts about the current purity come from non-diploid regions, where all peaks are mismatched; for this sample CNAs called with a low-purity solution should be compared to the 100% purity solution. ​c. CCF computation for the sample. Notice that in triploid and tetraploid tumour genomes we do not find mutations present in 2 copies. Was this true then the tumour did not acquire any SNV right before the CNA. Also, here we are not cross-checking QC results from peak .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. detection; for instance we could decide to use only mutations that map to PASS states (1:1, 2:2), and reject all others. Supplementary Figure 12. ​Example PCAWG pancreatic adenocarcinoma with 99% purity (and 3 possible driver SNVs, 2 of them involving tumour suppressor genes in LOH regions). ​a. ​Data for the sample (genome-wide CNA segments, CCF and read counts distribution). ​b. This sample has 90% of its SNVs in diploid tumour regions, and the others in a variety of distinct CNA segments. From a peak analysis point of view, all the calls are validated. ​c.​ CCF values for this sample are also good. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/ Househam et al. A fully automated approach for quality control of cancer mutations in the era of high-resolution whole genome sequencing. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.13.429885doi: bioRxiv preprint https://doi.org/10.1101/2021.02.13.429885 http://creativecommons.org/licenses/by-nc-nd/4.0/