key: cord-0851674-0beey8tv authors: Marcolungo, L.; Beltrami, C.; Degli Esposti, C.; Lopatriello, G.; Piubelli, C.; Mori, A.; Pomari, E.; Deiana, M.; Scarso, S.; Bisoffi, Z.; Grosso, V.; Cosentino, E.; Maestri, S.; Lavezzari, D.; Iadarola, B.; Paterno, M.; Segala, E.; Giovannone, B.; Gallinaro, M.; Rossato, M.; Delledonne, M. title: ACoRE: Accurate SARS-CoV-2 genome reconstruction for the characterization of intra-host and inter-host viral diversity in clinical samples and for the evaluation of re-infections date: 2021-01-26 journal: nan DOI: 10.1101/2021.01.22.21250285 sha: d1d0d57f1ee0246b448898f6b2484360a9df32cf doc_id: 851674 cord_uid: 0beey8tv We report Accurate SARS-CoV-2 genome Reconstruction (ACoRE), an amplicon-based viral genome sequencing workflow for the complete and accurate reconstruction of SARS-CoV-2 sequences from clinical samples, including suboptimal ones that would usually be excluded even if unique and irreplaceable. We demonstrated the utility of the approach by achieving complete genome reconstruction and the identification of false-positive variants in >170 clinical samples, thus avoiding the generation of inaccurate and/or incomplete sequences. Most importantly, ACoRE was crucial to identify the correct viral strain responsible of a relapse case, that would be otherwise mis-classified as a re-infection due to missing or incorrect variant identification by a standard workflow. The reconstruction of complete and accurate genomic sequences to detect both SNVs and iSNVs is 73 therefore necessary to produce reliable data, at all these aims. In addition, the accumulation of 74 meaningful data during pandemics requires the analysis of many samples, and the corresponding 75 methods must therefore be cost-effective, straightforward and suitable for high-multiplexing [24] . The 76 protocols must also be sensitive enough to detect low viral titers but applicable over a wide dynamic 77 range of virus concentrations to allow the analysis of clinical samples with different viral loads, ideally 78 including samples from early and late infection stages, that usually show a lower viral detection, or 79 from re-infection/relapse cases [25, 26] . 80 Among the many approaches available for SARS-CoV-2 whole-genome analysis, the amplicon-based is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint present in another, even when produced from the same cDNA. The concordance (Rc) in sequencing 122 coverage was high for replicates of four samples (Rc ~0.99-1) but lower in sample S5 (Rc ~0.95) with 123 the lowest viral load ( Figure 1B and Table S4 ), but there was no significant difference between 124 replicates from the same or different cDNAs (p = 0.25, Wilcoxon test). Variations in coverage can affect 125 genotyping accuracy, so we evaluated reproducibility in terms of genotypability by calculating the 126 fraction of genomic positions where it is possible to call a genotype after aligning reads to the 127 reference genome. The genotypability Rc was optimal or slightly lower than 1 in all samples (Rc = 0.99-128 1), but lower in sample S5, which also showed the lowest sequencing coverage Rc ( Figure 1C and Table 129 S5). Reproducibility was similar between inter-cDNA replicates and intra-cDNA replicates (p > 0. 99, 130 Wilcoxon test). To assess how fluctuations in genotypability and coverage affect the final viral genome 131 sequences, we generated a consensus sequence for each replicate. The reproducibility among 132 consensus variants was optimal in the first four samples, but consistently dropped to ~0.3 for sample 133 S5 ( Figure 1D and Table S6 ). Nevertheless, reproducibility was again similar between inter-cDNA 134 replicates and intra-cDNA replicates (p > 0.99, Wilcoxon test). 135 The number of iSNVs (frequency >3%) varied significantly between technical replicates, with a small 136 fraction of iSNVs shared by different replicates compared to the total number of iSNVs identified 137 (Table S7) . The Rc was suboptimal (<0.95) for all samples and steadily decreased as the Ct value 138 increased ( Figure 1E and Table S8 ), but there was no significant difference between replicates 139 generated from the same or different cDNAs (p = 0.44, Wilcoxon test). In summary, consensus 140 sequences and intra-host variants can be strongly affected by uneven amplicon representation and 141 PCR errors (Figure 2 ) confirming the need to sequence at least two replicates to achieve an accurate 142 characterization of the SARS-CoV-2 genome. However, the two amplifications can be generated from 143 the same starting cDNA, thus reducing sample consumption and costs. 144 . CC-BY-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Improvement of genome reconstruction by merging technical replicates 145 While addressing the reproducibility issues observed for both SNVs and iSNVs in samples with low viral 146 loads, we also tested whether merging two or more technical replicates could improve coverage and 147 genotypability. The rationale was the observation that amplicons with the lowest coverage varied 148 across different replicates, and amplicons missing in one replicate could have a coverage >100× or 149 >1000× in others ( Figure S1 ). All possible combinations of two replicates for each sample were merged 150 and downsampled to 800,000 fragments (400,000 for each replicate) to obtain the same sequencing 151 input data as the initial analysis based on a single replicate (Table S9) . When considering the merged 152 datasets rather than single-replicate data, the average coverage consistently increased in the sample 153 with the highest Ct value (p < 0.0001, Mann Whitney U-test), confirming that merging two 154 amplification replicates (intra-cDNA or inter-cDNA) could mitigate the technical variability in amplicon 155 coverage ( Figure 3A -C) as well as significantly (p < 0.0001, Mann Whitney U-test) enhance the 156 genotypability ( Figure 3B) . Merging up to six replicates achieved a slight further improvement in both 157 coverage and genotypability ( Figure 3A-B) , indicating that both properties can be maximized by 158 analyzing replicates of samples with low viral loads. Indeed, merging all sequence data available for 159 sample S5 (with the lowest reproducibility) increased coverage sufficiently to achieve >96.98% non-160 ambiguous bases in the consensus sequence ( Figure One drawback of the ARTIC protocol on the Illumina platform is the need for 250PE sequencing to 166 cover the full length of the amplicons (400 bp). This type of sequencing is currently available only for 167 MiSeq and NovaSeq6000 SP flow cells, increasing the cost per sample and reducing the sample 168 throughput. We therefore generated shorter libraries using the NexteraFlex approach and tested the 169 use of alterative flow cells (NextSeq500/550 and NovaSeq6000 S1) and sequencing mode (150PE) on 170 . CC-BY-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint the 30 samples originally tested using the KAPA library ( Figure 1A) . Despite skipping the laborious 171 input DNA and library quantification steps before sequencing, the variability in the number of 172 fragments analyzed per sample was lower (CV = 22.5%) than the full-amplicon approach (CV = 38.3%) 173 described above (Figure 4A) . The sequencing data were mapped to the reference genome (Table S10 ) 174 and compared to the 250PE dataset (KAPA library) normalized with the same average-mapped 175 coverage as the 150PE dataset (NexteraFlex library) (Table S11). Sequencing coverage was evenly 176 distributed along the amplicons even when the NexteraFlex protocol was used, because the partial 177 overlap of ARTIC amplicons compensated for the expected loss of sequence representation at the 178 amplicon ends due to tagmentation ( Figure 4B ). The sequencing of fragmented amplicons had no 179 adverse impact on genome coverage and genotypability, which were significantly higher compared to 180 the full-length amplicon sequencing (p < 0.001 and p = 0.024, respectively, Friedman test; Figure 4C -181 D). Despite the lower coverage, similar results were observed with 100PE sequencing simulated after 182 trimming the 150PE dataset ( Figure 4C-D) . The fragmented-amplicon approach was therefore 183 advantageous for multiple aspects of SARS-CoV-2 sequencing, by increasing coverage, genotypability 184 and throughput (allowing higher multiplexing) while reducing sequencing costs and eliminating 185 unnecessary protocol steps such as DNA quantification after PCR and library quantification before 186 pooling. 187 Although the NexteraFlex protocol saves on costs, this is offset by the requirement for multiple 188 sequencing replicates from the same sample to improve genome coverage. We therefore compared 189 the effect of sequencing a library generated from two replicates (each amplified from 5 µL of cDNA) 190 and a standard library prepared from a single amplification generated from double amount of cDNA 191 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. consensus sequences based on single-cDNA analysis, but these were efficiently removed by 215 considering the concordance between replicates (Table S13) . 216 The identification of SARS-CoV-2 genetic variants at different time points can reveal whether recurrent 218 infections are relapses caused by the same strain or independent infections with a different strain. 219 We therefore evaluated our optimized workflow in a case-study of relapse/re-infection involving a 48-220 year-old female patient who was hospitalized with mild COVID-19 symptoms following a positive 221 . CC-BY-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint nasopharyngeal swab on 4/3/2020, discharged with no symptoms on 11/3/2020 followed by two 222 consecutive negative swab tests, but readmitted with mild COVID-19 symptoms 12 days later. During 223 the second hospital stay, the nasopharyngeal swab test results fluctuated, and the patient was finally 224 discharged on 21/4/2020 with no symptoms, and two consecutive negative molecular tests. Three 225 swab samples (one from the first and two from the second hospitalization period) were sequenced to 226 identify the viral strain responsible for infection ( Table 1) . All samples were sequenced in duplicate or 227 quadruplicate (Table S14) , and consensus variants were called in order to identify the viral strains. 228 Depending on the replicate, some consensus variants identified in the first hospitalization period were 229 missing or could not be genotyped in the second hospitalization period, leading to the hypothesis that 230 different strains could be responsible for each infection ( Table 1 ). In contrast, when merging 231 sequencing replicates, the same variants were identified in all three samples ( Table 1 ) and a very high-232 frequency (99.95%) false-positive variant could be identified at position 12890 (Table S13) . Based on 233 this analysis, we concluded that the same viral strain was responsible of both the first and second 234 infection, and that the latter should therefore not be classed as a re-infection. 235 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. The positions of high-frequency variants (>75%) are shown in the consensus sequence of a specimen 237 collected during the first hospitalization. For each of these positions, the genotypes identified in the 238 samples collected during the second hospitalization are also shown. Genotypes are reported for each 239 sequencing replicate independently or after merging all replicates from the same sample (merged). 240 Positions that could not be genotyped are indicated with a dash. 241 Protocol optimization for simplicity, flexibility, throughput and cost-efficiency 243 Amplicon-based sequencing (originally called PrimalSeq) is the most sensitive and widely-used 244 protocol for SARS-CoV-2 whole-genome analysis from clinical isolates, but its disadvantages include 245 uneven amplicon coverage and poor accuracy when the viral load is low [23] . We addressed these 246 limits by improving the accuracy and completeness of sequencing, as well as the cost-efficiency and 247 throughput, thus achieving the highly reliable analysis of SARS-CoV-2 genomes. This benchmarking 248 analysis established a robust workflow, ACoRE, that allowed the complete and accurate 249 characterization of SARS-CoV-2 genomes in 170 clinical samples, including a subset (42%) with very 250 low viral titers (Ct ≥ 30). We were also able to properly categories an infection-relapse case study. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint sequencing data from two or more replicates as a simple solution to enhance coverage and 289 genotypability, achieving a more homogeneous representation of the viral genome and rescuing the 290 suboptimal samples. The random amplification observed in low-titer samples most likely reflects the 291 low sample complexity rather than poor assay sensitivity or performance. Accordingly, the sampled 292 RNA and corresponding cDNA fragments before amplification are unlikely to represent the complete 293 genome based on our observation that the coverage achieved by sequencing two amplification 294 replicates (each from 5 µL of cDNA) was similar to that achieved with a single amplification starting 295 from double the amount of cDNA (10 µL). Therefore, to optimize genome reconstruction, a single large 296 cDNA batch should be amplified in several parallel reactions, using as much sample volume as possible 297 to increase complexity. The multiple PCR products can then be pooled before library preparation and 298 sequenced as a single sample to avoid increasing costs. 299 As well as improving coverage and genotypability, at least two amplification reactions must be 300 analyzed to achieve accurate variant calling (SNVs and iSNVs). It is well established that the analysis 301 of viral iSNVs down to 3% frequency requires the generation of multiple replicates to distinguish true-302 positive iSNVs from low-frequency PCR or sequencing errors [23] . In contrast, the generation of 303 consensus sequences for the analysis of SNVs in epidemiological studies requires the identification of 304 the most-frequent nucleotide at each position and is typically based on single replicates [12, 45] . 305 However, we discovered that consensus sequences also contain frequent SNV errors (>12% in our 306 cohort) and the comparison of technical replicates is required to ensure accuracy. This was not 307 confined to low-titer samples (Ct > 30) but also included some samples with moderate viral loads (Ct 308 = 25-30) potentially leading to the submission of inaccurate consensus sequences to public 309 repositories such as GISAID. These false-positive variants probably arose due to PCR errors because 310 they were not found in other amplification replicates (either from the same or different cDNA). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. The assessment of re-infections 319 Reconstruction of highly accurate sequences from sub-optimal samples was crucial to identify the 320 correct viral strain responsible of a second hospitalization case, that was hypothesized to be a re- for whom "…genomic analysis of SARS-CoV-2 showed genetically significant differences between each 327 variant associated with each instance of infection…" suggesting that "…the patient was infected by 328 SARS-CoV-2 on two separate occasions by a genetically distinct virus…" [45] . The viral load of the swab 329 samples analyzed in that study was very low (Ct > 35) based on 14-22 PCR cycles-protocol without 330 amplification replicates, therefore potential false-positive variants and/or regions with low 331 genotypability may have influenced the results. We reanalyzed the data and noted that two of the 332 four variants specifically associated with the first infection had insufficient sequencing coverage to 333 achieve confident variant calling in the sample from the second infection (Table S15 ). In particular, 334 our bioinformatic pipeline revealed that position 539 was covered by only five reads, thus a genotype 335 could not be properly called; while variant 16741G→T (supported by 10 reads) was only just above 336 the genotypability threshold of 8 (Table S15) . These positions were genotyped using the bioinformatic 337 pipeline utilized by the authors because the limit was set to five reads. Furthermore, variant 4113C→T 338 showed frequency of 67.82% in the first infection, suggesting that two viral strains were already 339 . CC-BY-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint present: a predominant strain carrying the identified variant and a less-abundant strain lacking the 340 variant that became prevalent in the second infection (Table S15) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Cuore Don Calabria Hospital, qualified for SARS-CoV-2 molecular diagnosis by the regional reference 366 laboratory (Department of Microbiology, University Hospital of Padua). After collection, swabs were 367 stored at 4 °C for a maximum of 48 h and analysed by the routine-used molecular diagnostic method 368 (RT-qPCR as indicated in the following paragraph). The remaining quantity of swab was then aliquoted 369 and preserved at -80 °C. The study was approved by the competent Ethical Committee for Clinical 370 Research of Verona and Rovigo Provinces (Prot N° 39528/2020). 371 The routine RT-qPCR protocol was based on a recommended test (emergency use authorization) 373 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint by 1 min on ice to anneal the primers. We then added 4 µL of 5× SSIV buffer, 1 µL of 100 mM DTT, 1 389 µL of 40 U/μL RNaseOUT, 1 µL of 200 U/μL SSIV enzyme (Thermo Fisher Scientific) and 6 µL nuclease-390 free water (total reaction volume = 20 µL) and heated the reaction to 23 °C for 10 min, 52 °C for 10 391 min and 80 °C for 10 min. We generated two or three cDNAs from each sample (depending on the 392 experiment), each of which was amplified 2-3 times using the ARTIC protocol. In each case, we mixed 393 2.5 or 5 µL cDNA (depending on the experiment) with 3.7 µL of 10 µM primer pools A and B from the 394 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint cycles of PCR. We cleaned up 10-µL aliquots of each amplified library using a 1:1 ratio of sample 414 purification beads (Illumina) and eluted the purified library in 20 µL of resuspension buffer (Illumina). 415 The resulting libraries were analyzed on the 4150 TapeStation System (average size 335-369 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. (intra-cDNA concordance) and triplets of replicates generated from different cDNAs (inter-cDNA 478 concordance) as shown in Table S2 . 479 The non-parametric Wilcoxon signed rank test and the Mann Whitney U-test were used to compare 481 matched pairs and non-matched data, respectively. The non-parametric Friedman test was used to 482 compare multiple paired groups. Significance of pairing was confirmed by calculating Spearman's rho. 483 We used GraphPad Prism 6.0 (GraphPad Software, San Diego, CA, USA) for all statistical analysis, with 484 a significance threshold of p < 0.05. 485 . CC-BY-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The raw reads dataset supporting the conclusions of this article is available at the NCBI SRA repository 493 under BioProject ID PRJNA690890. 494 The authors declare that they have no competing interests 496 The work performed at IRCCS Sacro Cuore Don Calabria Hospital was supported by the Italian Ministry 498 of Health "Fondi Ricerca corrente-L1P5". is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.22.21250285 doi: medRxiv preprint COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins 521 University (JHU) A new coronavirus associated with human 523 respiratory disease in China Spike mutation D614G alters SARS-526 CoV-2 fitness Tracking Changes in 529 SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Emergence of genomic 532 diversity and recurrent mutations in SARS-CoV-2 The Impact of Mutations in SARS-CoV-2 Spike on Viral 535 Patient-derived SARS-CoV-2 mutations impact 538 viral replication dynamics and infectivity in vitro and with clinical implications in vivo Evolutionary dynamics of 541 SARS-CoV-2 nucleocapsid protein and its consequences Genomic epidemiology 543 reveals transmission patterns and dynamics of SARS-CoV-2 in Aotearoa New Zealand Revealing COVID-19 transmission 547 in Australia by SARS-CoV-2 genome sequencing and agent-based modeling Spread 549 of SARS-CoV-2 in the Icelandic population Rapid 551 SARS-CoV-2 whole-genome sequencing and analysis for informed public health decision-making in 552 the Netherlands Clinical and biological insights from viral genome sequencing Genomic surveillance 556 elucidates Ebola virus origin and transmission during the 2014 outbreak. Science (80-) Stochastic processes 559 constrain the within and between host evolution of influenza virus Real-time digital pathogen surveillance -the time is now Transmission, and Evolution during Seven Months in Sierra Leone Intra-host sequence variability in human papillomavirus Improvements to the 579 ARTIC multiplex PCR method for SARS-CoV-2 genome sequencing using nanopore. bioRxiv Prepr 580 Serv Biol SARS-CoV-2 detection, viral 584 load and infectivity over the course of an infection Clinical and virological 586 data of the first cases of COVID-19 in Europe: a case series Highly sensitive and full-genome 589 interrogation of SARS-CoV-2 using multiplexed PCR enrichment followed by next-generation 590 sequencing SARS-CoV-2 genomes 592 recovered by long amplicon tiling multiplex approach using nanopore sequencing and applicable to 593 other sequencing platforms Disentangling primer interactions 595 improves SARS-CoV-2 genome sequencing by multiplex tiling PCR Amplicon Sequencing Identifies Community Spread and Ongoing Evolution of SARS-CoV-2 in the 599 Performance of targeted library 602 preparation solutions for SARS-CoV-2 whole genome analysis Multiplex PCR 604 method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical 605 samples Stability of SARS-CoV-607 2 phylogenies Quality control of low-frequency variants in SARS-CoV-2 genomes Geographic and Genomic Distribution of SARS-CoV-2 Mutations. Front 611 Microbiol Limited SARS-CoV-2 diversity 613 within hosts and following passage in cell culture 615 Naturally occurring SARS-CoV-2 gene deletions close to the spike S1/S2 cleavage site in the viral 616 quasispecies of COVID19 patients A benchmarking study of SARS-CoV-2 618 whole-genome sequencing protocols using COVID-19 patient samples Oligonucleotide 621 capture sequencing of the SARS-CoV-2 genome and subgenomic fragments from COVID-19 622 individuals Genomic Epidemiology of SARS-CoV-2 in 625 Guangdong Province Whole genome 627 sequencing of sars-cov-2: Adapting illumina protocols for quick and accurate outbreak investigation 628 during a pandemic Guidelines for accurate 630 genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples Reinfection of 634 COVID-19 after 3 months with a distinct and more aggressive clinical presentation: Case report Genomic evidence for 637 reinfection with SARS-CoV-2: a case study CoV-2 PCR cycle threshold values provide practical insight into overall and target-Specific sensitivity 641 among symptomatic patients Recurrence of Positive SARS-643 CoV-2 Results in Patients Recovered From COVID-19 Prolonged shedding of severe acute 645 respiratory syndrome coronavirus 2 in patients with COVID-19 Persistent Detection and Infectious Potential of SARS-CoV-2 Virus in Clinical Specimens 648 from COVID-19 Patients European Centre for Disease Prevention and Control. Reinfection with SARS-CoV: 650 considerations for public health response: ECDC COVID-19 reinfection: are we ready for 653 winter? EBioMedicine Intra-host 655 evolution during SARS-CoV-2 persistent infection Phylogenetic network analysis of SARS-CoV-2 658 genomes 54. CDC 2019-Novel Coronavirus (2019-nCoV) Real-Time RT-PCR Diagnostic Panel Fast processing of NGS alignment 662 formats Trimmomatic: A flexible trimmer for Illumina sequence data Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM A statistical framework for SNP calling, mutation discovery, association mapping and 668 population genetical parameter estimation from sequencing data BEDTools: A flexible suite of utilities for comparing genomic features The Genome 672 Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data Minimap2: Pairwise alignment for nucleotide sequences VarScan 2: Somatic mutation 677 and copy number alteration discovery in cancer by exome sequencing Figure 1. Comparison of intra-cDNA and inter-cDNA replicates of SARS-CoV-2 genome amplification 720 and sequencing. (A) Schematic diagram showing the five clinical samples obtained from COVID-19 721 patients, their RT-qPCR Ct values and the experimental workflow. For each sample, we generated 722 three independent cDNAs and each cDNA was amplified in duplicate using the ARTIC nCoV Amplicons used as the input for library preparation were sequenced in 250PE mode on the 724 The bar charts show mean concordance rates (± standard deviations) for (B) 725 genome coverage, (C) genotypability, (D) consensus variants and (E) iSNV between amplification 726 replicates generated from different cDNAs (inter-cDNA) or the same cDNA Figure 2. Coverage and variant calling between intra-cDNA and inter-cDNA replicates Green bars represent the amplicons generated using the ARTIC original primer set, and 743 orange bars represent the amplicons generated using the alternative V3 primers Viewer (IGV) visualization of four representative sequencing replicates of sample S5 in the region 746