key: cord-0969978-xuhssu19
authors: Marcolungo, Luca; Beltrami, Cristina; Esposti, Chiara Degli; Lopatriello, Giulia; Piubelli, Chiara; Mori, Antonio; Pomari, Elena; Deiana, Michela; Scarso, Salvatore; Bisoffi, Zeno; Grosso, Valentina; Cosentino, Emanuela; Maestri, Simone; Lavezzari, Denise; Iadarola, Barbara; Paterno, Marta; Segala, Elena; Giovannone, Barbara; Gallinaro, Martina; Rossato, Marzia; Delledonne, Massimo
title: ACoRE: Accurate SARS-CoV-2 genome reconstruction for the characterization of intra-host and inter-host viral diversity in clinical samples and for the evaluation of re-infections
date: 2021-04-08
journal: Genomics
DOI: 10.1016/j.ygeno.2021.04.008
sha: 896e2b15845eab1fdb8d71ba4be12d08505bca3b
doc_id: 969978
cord_uid: xuhssu19

Sequencing the SARS-CoV-2 genome from clinical samples can be challenging, especially in specimens with low viral titer. Here we report Accurate SARS-CoV-2 genome Reconstruction (ACoRE), an amplicon-based viral genome sequencing workflow for the complete and accurate reconstruction of SARS-CoV-2 sequences from clinical samples, including suboptimal ones that would usually be excluded even if unique and irreplaceable. The protocol was optimized to improve flexibility and the combination of technical replicates was established as the central strategy to achieve accurate analysis of low-titer/suboptimal samples. We demonstrated the utility of the approach by achieving complete genome reconstruction and the identification of false-positive variants in >170 clinical samples, thus avoiding the generation of inaccurate and/or incomplete sequences. Most importantly, ACoRE was crucial to identify the correct viral strain responsible of a relapse case, that would be otherwise mis-classified as a re-infection due to missing or incorrect variant identification by a standard workflow.

The coronavirus disease 2019 (COVID- 19) pandemic has thus far resulted in the infection of more than 84 million people, causing at least 1.8 million deaths (Johns Hopkins University, 1/1/2021)[1] .

The agent responsible for COVID-19 is a β-coronavirus known as severe acute respiratory syndromeassociated coronavirus 2 (SARS-CoV-2) with a compact single-stranded RNA genome of 29, 903 nucleotides. The first SARS-CoV-2 genome sequence was published soon after the initial outbreak [2] , and more than 260,000 complete genome sequences have subsequently been deposited in the GISAID database [3] . The phylogenetic analysis of genomic sequences provides a valuable tool to track viral diversity during the course of a pandemic and to identify the emergence of prevalent strains characterized by lineage-specific single nucleotide variants (SNVs), such as the D614G variant in the SARS-CoV-2 spike protein gene (23403:A→G) [4] [5] [6] . As the virus propagates in human-tohuman transmission, changes in the reference genome sequence must be recorded to monitor correlations between viral genotype and disease communicability, manifestation and severity [4, [7] [8] [9] . The combination of genomic analysis and epidemiological data can also reliably determine the extent of SARS-CoV-2 transmission in different nations [10] [11] [12] and thus facilitates early decisionmaking to control local transmission [13] . Finally, mutations that may be relevant to the fitness or antigenic profile of the virus can be identified to ensure the efficacy of vaccines and immunotherapeutic interventions in the clinic [4, 14] .

Consensus variations reflect the analysis of virus sequences that differ between patients, but the analysis of intra-individual single nucleotide variations (iSNVs) is also important because it helps us to understand more about virus-host interactions, as previously demonstrated for Ebola, Zika, influenza and HIV [15] [16] [17] [18] [19] . The analysis of iSNVs during the COVID-19 pandemic may also provide data about the potential of SARS-CoV-2 for immunological escape and resistance to therapy, as well as on the sensitivity of molecular diagnostic assays [20] [21] [22] . However, the identification of iSNVs in clinical samples can be challenging because current protocols often feature enrichment and J o u r n a l P r e -p r o o f Journal Pre-proof amplification steps that introduce technical errors indistinguishable from true biological variants [23] .

The reconstruction of complete and accurate genomic sequences to detect both SNVs and iSNVs is therefore necessary to produce reliable data, at all these aims. In addition, the accumulation of meaningful data during pandemics requires the analysis of many samples, and the corresponding methods must therefore be cost-effective, straightforward and suitable for high-multiplexing [24] .

The protocols must also be sensitive enough to detect low viral titers but applicable over a wide dynamic range of virus concentrations to allow the analysis of clinical samples with different viral loads, ideally including samples from early and late infection stages, that usually show a lower viral detection, or from re-infection/relapse cases [25, 26] .

Among the many approaches available for SARS-CoV-2 whole-genome analysis, the amplicon-based sequencing method developed by the ARTIC Network [27] is currently the most widely used [13, 24, [28] [29] [30] [31] [32] . Based on the PrimalSeq protocol originally developed for Zika virus [23, 33] , the ARTIC Network designed a set of 98 tiled amplicons in two PCR pools for the targeted whole-genome amplification of SARS-CoV-2 [27] . This approach is simple and highly sensitive, but it suffers from technical biases leading to uneven genome coverage, thus reducing the completeness and accuracy of genome sequencing, especially for the identification iSNVs in samples with low viral titers [34] [35] [36] .

Sequencing technical replicates of multiple cDNAs generated from the same sample has been proposed as a mitigation strategy to identify iSNVs more reliably [23] . However, whereas ampliconbased sequencing has been widely used for the analysis of low-frequency variants [20] [21] [22] 37, 38] only a few studies thus far have evaluated the confidence of such calls and have implemented the sequencing of cDNA replicates to ensure accuracy [23] . False positives have also been reported among high-frequency variants supported by good sequencing depth, indicating that the risks of inaccurate sequencing are not limited to suboptimal samples [39] .

To avoid the generation of incomplete genomic sequences typically associated with poor genome coverage [40] [41] [42] , the sequencing of samples with fewer than 1000 virus copies per RT-qPCR reaction (Ct > 30) is currently discouraged [23, 43] . However, the strict implementation of such recommendations would lead to the exclusion of many clinical samples, which are often unavoidably collected or stored under suboptimal conditions. Since specimens with these features may be unique and irreplaceable -central to the investigation conducted-, numerous studies therefore report sequencing data from samples with (very) low viral titers (Ct > 30) despite this advice [26, 44, 45] . To address these challenges, we set out to develop an optimized workflow, ACoRE (Accurate SARS-CoV-2 genome Reconstruction), for the reliable reconstruction of complete and accurate SARS-CoV-2 genomes from clinical samples with a broad range of Ct values, aiming to improve the flexibility, accuracy and throughout of amplicon-based sequencing.

178 Nasopharyngeal swabs (eSwab, Copan, Italy) were obtained from 172 COVID-19 patients diagnosed at the Department of Infectious, Tropical Diseases and Microbiology of the IRCCS Sacro Cuore Don Calabria Hospital, qualified for SARS-CoV-2 molecular diagnosis by the regional reference laboratory (Department of Microbiology, University Hospital of Padua). After collection, swabs were stored at 4 °C for a maximum of 48 h, analysed by the molecular diagnostic method described in the following paragraph) and subsequently stored at -80 °C. The study was approved by the competent Ethical Committee for Clinical Research of Verona and Rovigo Provinces (Prot N° 39528/2020).

The routine RT-qPCR protocol was based on the WHO guidelines [46] . Briefly, RNA was extracted from 200L of swabs using the automated Microlab Nimbus workstation (Hamilton, Reno, NV, USA) coupled to a Kingfisher Presto system (Thermo Fisher Scientific, Waltham, MA, USA) or using the J o u r n a l P r e -p r o o f Journal Pre-proof MagnaMax Viral/Pathogen extraction kit (Thermo Fisher Scientific) according to the manufacturer's instructions. RT-qPCR was carried out using the CDC 2019-nCoV rRT-PCR Diagnostic Panel assay and protocol [47] , targeting the nucleocapsid protein gene regions N1 and N2 (with the human RNAse P gene as the internal control) on a CFX96 Touch system (Bio-Rad Laboratories, Hercules, California, USA). The amplification cycle threshold (Ct) was determined using CFX Maestro (Bio-Rad), setting a baseline threshold at 200 relative fluorescence units (RFU). A standard curve from 5 to 500 genome copies per reaction was performed with serial dilution of the CDC control plasmid containing the CDC qPCR Assays target regions (2019-nCoV_N_Positive Control, Integrated DNA Technologies, Coralville, Iowa, USA) containing the complete nucleocapsid gene of SARS-CoV-2 (Table S1).

Samples with Ct values of 15-18 were diluted 10-fold as suggested by the ARTIC Network [27] . RNA 

Libraries were prepared from 50 ng of virus amplicons using the KAPA Hyper prep kit and unique Resulting libraries were analyzed on the 4150 TapeStation System (average size 335-369 bp), pooled and quantified using the Qubit dsDNA BR Assay kit. Libraries were sequenced on a Novaseq 6000 device (Illumina) using a SP flow cell in 100PE mode, or on a NextSeq500 (Illumina) in 150PE mode.

Full-length amplicon sequencing data were randomly downsampled using seqtk sample v1.3 (https://github.com/lh3/seqtk). To compare sequencing data from the full-length and fragmented amplicons, KAPA library reads were downsampled at the same mean mapped coverage as the corresponding Illumina library replicates using sambamba v0. 6.7 [48] . To simulate sequencing using J o u r n a l P r e -p r o o f Journal Pre-proof 100PE reads, data from the fragmented amplicon libraries were trimmed using a custom script. All sequencing datasets were trimmed for quality and adapters were removed using Trimmomatic v0.39 [49] with the following parameters: ILLUMINACLIP:adapters_file:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:20. Filtered reads were aligned to the SARS-CoV-2 reference genome (GenBank ID: MN908947.3) using BWA MEM v0.7.17 [50] with default parameters and the relative alignment file was converted to BAM file using SAMtools v1.9 [51] . For the fragmented libraries, duplicate reads were identified and discarded using Picard v2. 21 .1 (http://broadinstitute.github.io/picard).

Subsequently, iVar v1.2.2 trim [23] was used to remove ARTIC v3 primer sequences from the BAM files. For the fragmented libraries, the -e parameter was used to include reads without primers.

Finally, overlapping portions of reads were clipped using fgbio ClipBam v1.1.0 (https://github.com/fulcrumgenomics/fgbio) with the following parameters: --clip-overlapping-reads -c Hard. Coverage and genotypability statistics were calculated from the BAM files using bedtools genomecov v2. 19 .1 [52] and GATK CallableLoci v3.8 [53] , respectively. Raw genomic sequencing data were deposited in NCBI GenBank (BioProject no PRJNA690890).

A pileup was calculated for each position in the BAM file of each replicate using the SAMtools v1.9 mpileup option with parameters -aa -A -d 0 -Q 0. The resulting files were used as input for iVar consensus v1.2.2 [23] to generate consensus sequences, considering those positions covered by at least three reads (parameters: -t 0 -m 3). The most abundant nucleotide for each position was reported in the consensus sequence, whereas positions covered by fewer than three reads or reporting an equal proportion of nucleotides were represented by the ambiguous character N.

To call variants present in the consensus sequences (consensus variants), sequences were aligned to the SARS-CoV-2 reference genome using Minimap v2.17 [54] and the alignment file was converted to the BAM format using SAMtools v1.9. Consensus variants were then called using bcftools call v1. 10 For each sample, inter-replicate discordant variants were identified by iSNV variant calling after merging sequencing data from all replicates, considering only genotyped positions. A discordant variant was defined as a variant called in one replicate, whereas the same position in the other replicate reported the reference allele.

The concordance rate (R c ) between replicates samples was calculated as follows:

Nc represents (i) the number of shared variants (consensus variants or iSNVs) excluding positions that could not be genotyped in at least one replicate, or (ii) the number of shared genotypable bases, excluding positions marked N in at least one replicate, or (iii) the number of shared amplicons

J o u r n a l P r e -p r o o f with coverage higher than three reads in all replicates. N1 and N2 represent the total number of iSNVs, consensus variants, genotypable bases or covered amplicons detected in each of the two samples in the analysis. R c was calculated by comparing couples of replicates generated from the same cDNA (intra-cDNA concordance) and triplets of replicates generated from different cDNAs (inter-cDNA concordance) as shown in Table S2 .

The non-parametric Wilcoxon signed rank test and the Mann Whitney U-test were used to compare matched pairs and non-matched data, respectively. The non-parametric Friedman test was used to compare multiple paired groups. Significance of pairing was confirmed by calculating Spearman's rho. We used GraphPad Prism 6.0 (GraphPad Software, San Diego, CA, USA) for all statistical analysis, with a significance threshold of p < 0.05.

The original Primalseq protocol stipulates two independent reverse transcriptions per sample and the subsequent amplification of the separate cDNAs in order to reduce technical errors. In this study, we initially tested replicate amplifications from the same cDNA to investigate whether this alternative approach could affect the reproducibility in the generation of SARS-CoV-2 consensus sequences and in the identification of intra-host variants. At this aim, we selected five COVID-19positive swabs representing viral loads ranging from ~500 to ~2 million, based on Ct values (determined by RT-qPCR) ranging from 15.07 to 28.5 (Table S1) . For each sample, we generated three cDNAs and carried out two separate amplifications, resulting in six replicates per starting RNA ( Figure 1A ). An individual KAPA library was prepared from each replicate, and sequencing in 250PE mode produced an average of 1 million fragments. The dataset was normalized to ~800,000 J o u r n a l P r e -p r o o f fragments per library, corresponding to ~7800× coverage per sample after alignment to the SARS-CoV-2 reference genome ( Table S3) .

The sequencing coverage was variable across the different amplicons of the ARTIC panel, particularly in samples with a higher Ct value (Figure 2 and Figure S1 ). Interestingly, most amplicons showed either high (>500×) or very low (≤10×) to zero coverage, and amplicons absent in one replicate could be present in another, even when produced from the same cDNA. The concordance (R c ) in sequencing coverage was high for replicates of four samples (R c ~0.99-1) but lower in sample S5 (R c ~0.95) with the lowest viral load ( Figure 1B and Table S4 ), but there was no significant difference between replicates from the same or different cDNAs (p = 0.25, Wilcoxon test). Variations in coverage can affect genotyping accuracy, so we evaluated reproducibility in terms of genotypability by calculating the fraction of genomic positions where it is possible to call a genotype after aligning reads to the reference genome. The genotypability R c was optimal or slightly lower than 1 in all samples (R c = 0.99-1), but lower in sample S5, which also showed the lowest sequencing coverage R c ( Figure 1C and Table S5 ). Reproducibility was similar between inter-cDNA replicates and intra-cDNA replicates (p > 0.99, Wilcoxon test). To assess how fluctuations in genotypability and coverage affect the final viral genome sequences, we generated a consensus sequence for each replicate. The reproducibility among consensus variants was optimal in the first four samples, but consistently dropped to ~0.3 for sample S5 ( Figure 1D and Table S6 ). Nevertheless, reproducibility was again similar between inter-cDNA replicates and intra-cDNA replicates (p > 0.99, Wilcoxon test).

The number of iSNVs (frequency >3%) varied significantly between technical replicates, with a small fraction of iSNVs shared by different replicates compared to the total number of iSNVs identified ( Table S7 ). The R c was suboptimal (<0.95) for all samples and steadily decreased as the Ct value increased ( Figure 1E and Table S8 ), but there was no significant difference between replicates generated from the same or different cDNAs (p = 0.44, Wilcoxon test). In summary, consensus sequences and intra-host variants can be strongly affected by uneven amplicon representation and J o u r n a l P r e -p r o o f PCR errors (Figure 2) confirming the need to sequence at least two replicates to achieve an accurate characterization of the SARS-CoV-2 genome. However, the two amplifications can be generated from the same starting cDNA, thus reducing sample consumption and costs.

While addressing the reproducibility issues observed for both SNVs and iSNVs in samples with low viral loads, we also tested whether merging two or more technical replicates could improve coverage and genotypability. The rationale was the observation that amplicons with the lowest coverage varied across different replicates, and amplicons missing in one replicate could have a coverage >100× or >1000× in others ( Figure S1 ). All possible combinations of two replicates for each sample were merged and downsampled to 800,000 fragments (400,000 for each replicate) to obtain the same sequencing input data as the initial analysis based on a single replicate (Table S9) . When considering the merged datasets rather than single-replicate data, the average coverage consistently increased in the sample with the highest Ct value (p < 0.0001, Mann Whitney U-test), confirming that merging two amplification replicates (intra-cDNA or inter-cDNA) could mitigate the technical variability in amplicon coverage (Figure 3A -C) as well as significantly (p < 0.0001, Mann Whitney U-test) enhance the genotypability (Figure 3B) . Merging up to six replicates achieved a slight further improvement in both coverage and genotypability ( Figure 3A-B) , indicating that both properties can be maximized by analyzing replicates of samples with low viral loads. Indeed, merging all sequence data available for sample S5 (with the lowest reproducibility) increased coverage sufficiently to achieve >96.98% non-ambiguous bases in the consensus sequence ( Figure 3C-D) , which is the GISAID threshold for classifying a SARS-CoV-2 genome as complete [3] . J o u r n a l P r e -p r o o f

One drawback of the ARTIC protocol on the Illumina platform is the need for 250PE sequencing to cover the full length of the amplicons (400 bp). This type of sequencing is currently available only for MiSeq and NovaSeq6000 SP flow cells, increasing the cost per sample and reducing the sample throughput. We therefore generated shorter libraries using a tagmentase-based approach (Illumina DNA Prep, former NexteraFlex) and tested the use of alterative flow cells (NextSeq500/550 and NovaSeq6000 S1) and sequencing mode (150PE) on the 30 samples originally tested using the KAPA library ( Figure 1A) . Despite skipping the laborious input DNA and library quantification steps before sequencing, the variability in the number of fragments analyzed per sample was lower (CV = 22.5%) than the full-amplicon approach (CV = 38.3%) described above ( Figure 4A) . The sequencing data were mapped to the reference genome ( (Table S11) . Sequencing coverage was evenly distributed along the amplicons even when the Illumina protocol was used, because the partial overlap of ARTIC amplicons compensated for the expected loss of sequence representation at the amplicon ends due to tagmentation (Figure 4B) . The sequencing of fragmented amplicons had no adverse impact on genome coverage and genotypability, which were significantly higher compared to the full-length amplicon sequencing (p < 0.001 and p = 0.024, respectively, Friedman test; Figure 4C -D). Despite the lower coverage, similar results were observed with 100PE sequencing simulated after trimming the 150PE dataset (Figure 4C-D) . The fragmented-amplicon approach was therefore advantageous for multiple aspects of SARS-CoV-2 sequencing, by increasing coverage, genotypability and throughput (allowing higher multiplexing) while reducing sequencing costs and eliminating unnecessary protocol steps such as DNA quantification after PCR and library quantification before pooling.

Although the Illumina DNA Prep protocol saves on costs, this is offset by the requirement for multiple sequencing replicates from the same sample to improve genome coverage. We therefore compared the effect of sequencing a library generated from two replicates (each amplified from 5 µL 

Next we applied the optimized workflow to a set of 170 clinical samples representing a wide range of viral loads, with Ct values in the range 15-40 ( Figure S3) . Each sample was amplified in duplicate or quadruplicate starting from 10 µL cDNA, and 100PE sequencing was carried on a NovaSeq6000 SP flow cell using Illumina libraries, generating an average of ~2.8 million fragments per replicate (Table   S12 ). After pooling data from the replicates, ~75% of the samples showed both coverage and genotypability >96.98% (Figure 5A- (Table S13) .

The identification of SARS-CoV-2 genetic variants at different time points can reveal whether recurrent infections are relapses caused by the same strain or independent infections with a different strain. We therefore evaluated our optimized workflow in a case-study of relapse/reinfection involving a 48-year-old female patient who was hospitalized with mild COVID-19 symptoms following a positive nasopharyngeal swab on 4/3/2020, discharged with no symptoms on 11/3/2020 followed by two consecutive negative swab tests, but readmitted with mild COVID-19 symptoms 12 days later. During the second hospital stay, the nasopharyngeal swab test results fluctuated, and the patient was finally discharged on 21/4/2020 with no symptoms, and two consecutive negative molecular tests. Three swab samples (one from the first and two from the second hospitalization period) were sequenced to identify the viral strain responsible for infection ( Table 1) . All samples were sequenced in duplicate or quadruplicate (Table S14) , and consensus variants were called in order to identify the viral strains. Depending on the replicate, some consensus variants identified in the first hospitalization period were missing or could not be genotyped in the second hospitalization period, leading to the hypothesis that different strains could be responsible for each infection ( Table   1 ). In contrast, when merging sequencing replicates, the same variants were identified in all three samples (Table 1 ) and a very high-frequency (99.95%) false-positive variant could be identified at position 12890 (Table S13) . Based on this analysis, we concluded that the same viral strain was responsible of both the first and second infection, and that the latter should therefore not be classified as a re-infection.

Amplicon-based sequencing (originally called PrimalSeq) is the most sensitive and widely-used protocol for SARS-CoV-2 whole-genome analysis from clinical isolates, but its disadvantages include uneven amplicon coverage and poor accuracy when the viral load is low [23] . We addressed these 

Clinical specimens with low viral loads reduce the accuracy of variant calling and the completeness of genome reconstruction, both of which are inversely correlated with the quality and quantity of starting material [23, 30, 43] . Current guidelines for viral genotyping recommend a lower limit of 1000 virus copies per reaction [23, 43] but this would rule out a large proportion of clinical samples, including ~53% of the samples in our cohort. A Ct value of ~25 was identified as the median for virus detection in symptomatic patients, with a consistent proportion of samples (15-25%) falling above Ct 30 [25, 56] . Low viral loads are often found in patients with prolonged COVID-19 infection [57] [58] [59] , and five of six reported cases of potential re-infection involved samples with Ct values >30 [60] , but whole-genome sequencing is nevertheless recommended to differentiate between relapse and new infections caused by a different SARS-CoV-2 variants [60, 61] . The ability to sequence SARS-CoV-2 genomes in low-titer samples is therefore necessary to track infections and correlate different strains with disease communicability, manifestation and severity.

Increasing the depth of sequencing has been proposed as a strategy to achieve complete genome reconstruction in low-titer samples, but this does not overcome limitations caused by missing amplicons [43] . Similarly, improvement in ARTIC primer design and compatibility (currently version 3) can also ameliorate genome coverage, but again cannot make up for missing amplicons [24, 30] .

We found that only a few specific amplicons were reproducibly suboptimal (64, 70 and 91) whereas most showed coverage variations limited to particular samples or replicates. We therefore merged the sequencing data from two or more replicates as a simple solution to enhance coverage and genotypability, achieving a more homogeneous representation of the viral genome and rescuing the suboptimal samples. The random amplification observed in low-titer samples most likely reflects the low sample complexity rather than poor assay sensitivity or performance. Accordingly, the sampled RNA and corresponding cDNA fragments before amplification are unlikely to represent the complete genome based on our observation that the coverage achieved by sequencing two amplification replicates (each from 5 µL of cDNA) was similar to that achieved with a single amplification starting from double the amount of cDNA (10 µL). Therefore, to optimize genome reconstruction, a single J o u r n a l P r e -p r o o f large cDNA batch should be amplified in several parallel reactions, using as much sample volume as possible to increase complexity. The multiple PCR products can then be pooled before library preparation and sequenced as a single sample to avoid increasing costs.

It must be noted that low viral loads are not linearly correlated to poor sequencing results, as also some samples with Ct>30 showed complete genome reconstruction even when considering only one replicate. Therefore, beside sample complexity and concentration, other factors could play a role, as for example the integrity of initial RNA samples or the presence of contaminants, whose effect may be more evident on low concentrated samples. Since these factors would similarly impact the completeness of genome reconstruction as low titers, the ACoRE workflow could provide an experimental solution also for highly degraded samples or specimens containing inhibitors.

As well as improving coverage and genotypability, at least two amplification reactions must be analyzed to achieve accurate variant calling (SNVs and iSNVs). It is well established that the analysis of viral iSNVs down to 3% frequency requires the generation of multiple replicates to distinguish true-positive iSNVs from low-frequency PCR or sequencing errors [23] . In contrast, the generation of consensus sequences for the analysis of SNVs in epidemiological studies requires the identification of the most-frequent nucleotide at each position and is typically based on single replicates [12, 45] .

However, we discovered that consensus sequences also contain frequent SNV errors (>12% in our cohort) and the comparison of technical replicates is required to ensure accuracy. This was not confined to low-titer samples (Ct > 30) but also included some samples with moderate viral loads (Ct = 25-30) potentially leading to the submission of inaccurate consensus sequences to public repositories such as GISAID. These false-positive variants probably arose due to PCR errors because they were not found in other amplification replicates (either from the same or different cDNA).

However, studies reporting SARS-CoV-2 consensus sequences thus far have not included the analysis of technical replicates, even in the case of low-titer samples (Ct > 30) [26, 62] . The accuracy of SARS-CoV-2 consensus sequences deposited in GSAID has been called into question for documented J o u r n a l P r e -p r o o f sequences with putative errors or a significant number of variants in one particular submission (singletons) [35] and the use of stringent filters and bioinformatic tools has been proposed as a solution [62, 63] . Instead, with ACoRE we propose the use of replicates as a simple experimental solution to avoid the generation of incorrect consensus sequences prior to database submission.

Since similar errors and amplification biases have been reported to limit the analysis also of other viral genomes, such as HIV, Influenza or Zika virus [23, 64, 65] , we could predict that the benefits of the ACoRE approach are not limited to SARS-CoV-2, but may be extended to the NGS-based analysis of viral infections in general.

Reconstruction of highly accurate sequences from sub-optimal samples was crucial to identify the correct viral strain responsible of a second hospitalization case, that was hypothesized to be a reinfection. A standard workflow would have missed or included incorrect variants in support of such hypothesis, while ACoRE properly recognized that the different time-point samples contained the same viral strain.

Another interesting example, that would certainly benefit of ACoRE, comes from a publication that reported the first individual in North America to have symptomatic reinfection with SARS-CoV-2 [26] , for whom "…genomic analysis of SARS-CoV-2 showed genetically significant differences between each variant associated with each instance of infection…" suggesting that "…the patient was infected by SARS-CoV-2 on two separate occasions by a genetically distinct virus…" [45] . The viral load of the swab samples analyzed in that study was very low (Ct > 35) based on 14-22 PCR cycles-protocol without amplification replicates, therefore potential false-positive variants and/or regions with low genotypability may have influenced the results. We reanalyzed the data and noted that two of the four variants specifically associated with the first infection had insufficient sequencing coverage to achieve confident variant calling in the sample from the second infection (Table S15 ). In particular, J o u r n a l P r e -p r o o f our bioinformatic pipeline revealed that position 539 was covered by only five reads, thus a genotype could not be properly called; while variant 16741G→T (supported by 10 reads) was only just above the genotypability threshold of 8 (Table S15) . These positions were genotyped using the bioinformatic pipeline utilized by the authors because the limit was set to five reads. Furthermore, variant 4113C→T showed frequency of 67.82% in the first infection, suggesting that two viral strains were already present: a predominant strain carrying the identified variant and a less-abundant strain lacking the variant that became prevalent in the second infection (Table S15) 

The raw sequencing data supporting the conclusions of this article are available at the NCBI SRA repository under BioProject ID PRJNA690890. Table S12 .

J o u r n a l P r e -p r o o f Table 1 . High-frequency variants identified in the COVID-19 relapse case study. The positions of high-frequency variants (>75%) are shown in the consensus sequence of a specimen collected during the first hospitalization. For each of these positions, the genotypes identified in the samples collected during the second hospitalization are also shown. Genotypes are reported for each sequencing replicate independently or after merging all replicates from the same sample (merged).

Positions that could not be genotyped are indicated with a dash.

J o u r n a l P r e -p r o o f

A new coronavirus associated with human respiratory disease in China

Spike mutation D614G alters SARS-CoV-2 fitness

Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus

Emergence of genomic diversity and recurrent mutations in SARS-CoV-2

The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity

Patient-derived SARS-CoV-2 mutations impact viral replication dynamics and infectivity in vitro and with clinical implications in vivo

Evolutionary dynamics of SARS-CoV-2 nucleocapsid protein and its consequences

Genomic epidemiology reveals transmission patterns and dynamics of SARS-CoV-2 in Aotearoa New Zealand. medRxiv

Revealing COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agent-based modeling

Spread of SARS-CoV-2 in the Icelandic population

Rapid SARS-CoV-2 whole-genome sequencing and analysis for informed public health decision-making in the Netherlands

Clinical and biological insights from viral genome sequencing

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science (80-)

Stochastic processes constrain the within and between host evolution of influenza virus

Real-time digital pathogen surveillance -the time is now

Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone

Intra-host sequence variability in human papillomavirus

SARS-CoV-2 exhibits intra-host genomic plasticity and low-frequency polymorphic quasispecies

Genomic diversity of SARS-CoV-2 in COVID-19 patients

Characterization of SARS-CoV-2 viral diversity within and across hosts

An ampliconbased sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

Improvements to the ARTIC multiplex PCR method for SARS-CoV-2 genome sequencing using nanopore

SARS-CoV-2 detection, viral load and infectivity over the course of an infection

Clinical and virological data of the first cases of COVID-19 in Europe: a case series

Highly sensitive and full-genome interrogation of SARS-CoV-2 using multiplexed PCR enrichment followed by next-generation sequencing

SARS-CoV-2 genomes recovered by long amplicon tiling multiplex approach using nanopore sequencing and applicable to other sequencing platforms

Disentangling primer interactions improves SARS-CoV-2 genome sequencing by multiplex tiling PCR

High-Density Amplicon Sequencing Identifies Community Spread and Ongoing Evolution of SARS-CoV-2 in the Southern United States

Performance of targeted library preparation solutions for SARS-CoV-2 whole genome analysis

Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples

Stability of SARS-CoV-2 phylogenies

Quality control of low-frequency variants in SARS-CoV-2 genomes

Geographic and Genomic Distribution of SARS-CoV-2 Mutations

Limited SARS-CoV-2 diversity within hosts and following passage in cell culture

Naturally occurring SARS-CoV-2 gene deletions close to the spike S1/S2 cleavage site in the viral quasispecies of COVID19 patients

A benchmarking study of SARS-CoV-2 whole-genome sequencing protocols using COVID-19 patient samples

Oligonucleotide capture sequencing of the SARS-CoV-2 genome and subgenomic fragments from COVID-19 individuals

Genomic Epidemiology of SARS-CoV-2 in Guangdong Province

Whole genome sequencing of sars-cov-2: Adapting illumina protocols for quick and accurate outbreak investigation during a pandemic

Guidelines for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples

Reinfection of COVID-19 after 3 months with a distinct and more aggressive clinical presentation: Case report

Genomic evidence for reinfection with SARS-CoV-2: a case study

Molecular assays to diagnose COVID-19: Summary table of available protocols

Trimmomatic: A flexible trimmer for Illumina sequence data

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

BEDTools: A flexible suite of utilities for comparing genomic features

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data

Minimap2: Pairwise alignment for nucleotide sequences

VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing

Distribution of SARS-CoV-2 PCR cycle threshold values provide practical insight into overall and target-Specific sensitivity among symptomatic patients

Report: Recurrence of Positive SARS-CoV-2 Results in Patients Recovered From COVID-19

Prolonged shedding of severe acute respiratory syndrome coronavirus 2 in patients with COVID-19

Persistent Detection and Infectious Potential of SARS-CoV-2 Virus in Clinical Specimens from COVID-19 Patients

European Centre for Disease Prevention and Control. Reinfection with SARS-CoV: considerations for public health response: ECDC

COVID-19 reinfection: are we ready for winter? EBioMedicine

Intra-host evolution during SARS-CoV-2 persistent infection

Phylogenetic network analysis of SARS-CoV-2 genomes

On the effective depth of viral sequence data

Error rates, PCR recombination, and sampling depth in HIV-1 whole genome deep sequencing

HIGHLIGHTS 1. Incomplete and inaccurate SARS-CoV-2 sequences can be generated from low

We gratefully acknowledge the Centro Piattaforme Tecnologiche (CPT) for granting access to the genomic facility of University of Verona. .

The work performed at IRCCS Sacro Cuore Don Calabria Hospital was supported by the Italian Ministry of Health "Fondi Ricerca corrente-L1P5".

The authors declare that they have no competing interests J o u r n a l P r e -p r o o f