key: cord-0969873-e5wcw9vk authors: Tonkin-Hill, Gerry; Martincorena, Inigo; Amato, Roberto; Lawson, Andrew R J; Gerstung, Moritz; Johnston, Ian; Jackson, David K; Park, Naomi R; Lensing, Stefanie V; Quail, Michael A; Gonçalves, Sónia; Ariani, Cristina; Chapman, Michael Spencer; Hamilton, William L; Meredith, Luke W; Hall, Grant; Jahun, Aminu S; Chaudhry, Yasmin; Hosmillo, Myra; Pinckert, Malte L; Georgana, Iliana; Yakovleva, Anna; Caller, Laura G; Caddy, Sarah L; Feltwell, Theresa; Khokhar, Fahad A; Houldcroft, Charlotte J; Curran, Martin D; Parmar, Surendra; Alderton, Alex; Nelson, Rachel; Harrison, Ewan; Sillitoe, John; Bentley, Stephen D; Barrett, Jeffrey C; Torok, M. Estee; Goodfellow, Ian G; Langford, Cordelia; Kwiatkowski, Dominic title: Patterns of within-host genetic diversity in SARS-CoV-2 date: 2020-12-24 journal: bioRxiv DOI: 10.1101/2020.12.23.424229 sha: 34a9a4ea1ca832815dd8ba8f2f0fe73ff115ebc3 doc_id: 969873 cord_uid: e5wcw9vk Monitoring the spread of SARS-CoV-2 and reconstructing transmission chains has become a major public health focus for many governments around the world. The modest mutation rate and rapid transmission of SARS-CoV-2 prevents the reconstruction of transmission chains from consensus genome sequences, but within-host genetic diversity could theoretically help identify close contacts. Here we describe the patterns of within-host diversity in 1,181 SARS-CoV-2 samples sequenced to high depth in duplicate. 95% of samples show within-host mutations at detectable allele frequencies. Analyses of the mutational spectra revealed strong strand asymmetries suggestive of damage or RNA editing of the plus strand, rather than replication errors, dominating the accumulation of mutations during the SARS-CoV-2 pandemic. Within and between host diversity show strong purifying selection, particularly against nonsense mutations. Recurrent within-host mutations, many of which coincide with known phylogenetic homoplasies, display a spectrum and patterns of purifying selection more suggestive of mutational hotspots than recombination or convergent evolution. While allele frequencies suggest that most samples result from infection by a single lineage, we identify multiple putative examples of co-infection. Integrating these results into an epidemiological inference framework, we find that while sharing of within-host variants between samples could help the reconstruction of transmission chains, mutational hotspots and rare cases of superinfection can confound these analyses. The SARS-CoV-2 pandemic has caused global disruption and more than one million deaths (1) . Genomic analysis has yielded important insights into the origins and spread of the pandemic, and has been an integral part of efforts to monitor viral transmission in the UK, with over 100,000 viral genomes sequenced as of December 2020 by the COVID-19 Genomics UK Consortium (COG-UK) (2) . For purposes of genomic epidemiology, a consensus genome sequence is derived from each sample but deep sequencing data invariably reveal some level of within-host variation (3) and minor alleles are commonly filtered out prior to phylogenetic analysis (4) . It has been suggested that SARS-CoV-2, like SARS-CoV-1 evolves within an infected host as a quasispecies, with many mutations (within-host variants) arising which may be beneficial for the virus (3, 5, 6) . Although there have been several previous reports on the within-host diversity of SARS-CoV-2 (3, 7, 8) a number of key questions remain to be resolved. Examples of underexplored questions include understanding the extent of sequencing artefacts among within-host variants, the mutational processes dominating SARS-CoV-2 evolution, the action of selection on within-host variants and the extent to which superinfection or co-transmission of multiple lineages confound within-host diversity analyses (9) . Better understanding of these questions can shed light on the evolution of SARS-CoV-2 and could inform efforts to reconstruct transmission chains with genomic data. To address these questions, we performed Illumina deep sequencing of over a thousand SARS-CoV-2 samples collected in March and April 2020 in the East of England. Two libraries were sequenced for each sample with separate reverse transcription (RT), PCR amplification and library preparation steps in order to evaluate the quality and reproducibility of within-host variant calls. To develop reliable methods for analysing within-host variation in the context of ongoing genomic epidemiological studies, we used the ARTIC protocol which is a common method used for SARS-CoV-2 genome sequencing by many labs around the world (10) . Analyses of the data provided insights into the extent of within-host diversity, the patterns of mutagenesis and the extent of selection, and suggested that amplification biases, hypermutable sites The mutational spectra reveal strong strand asymmetries. Analysis of the mutational spectrum can yield insights into the dominant mutational mechanisms underlying the evolution of SARS-CoV-2 during the pandemic. Consistent with previous reports (13) , the mutational spectrum of within-host variant closely resembles that of major variants ( Figure 2 ). The spectrum shows two striking features: a dominance of C>U and G>U changes with weak extended sequence context, and a large asymmetry between the plus and minus strands, inferred from the high C>U/G>A and G>U/C>A ratios when mutations are mapped to the reference (plus) strand. C>U mutations account for 47% of all within-host point mutations compared to 5.9% for G>A mutations (C>U mutations in the minus strand), and G>U mutations for 14.5% of mutations compared to 2.2% of C>A mutations (Methods). Normalised for sequence composition, the plus/minus strand ratios of the rates of C>U and G>U are 9.9-fold and 8.2-fold, respectively. These asymmetries appear difficult to explain if mutations were direct replication errors. Since any given viral RNA molecule has undergone the same number of plus-to-minus and minus-to-plus replication steps, replication errors are only expected to cause these asymmetries if both steps have very different error rates. For example, if the polymerase had a high C>U error rate in both strands, we would expect a symmetric number of C>U and G>A mutations in the plus strand, as C>Us would be introduced at equal rates when copying the plus or the minus strands ( Figure 2E ). Within a cell, the fact that minus templates are copied many times could theoretically lead to a viral population with fewer G>A mutations but at higher frequencies (14, 15) . However, the fact that the same asymmetries are observed for major variants, which typically represent the fixation of a single genotype, makes this an unlikely factor. Thus, unless the error rates of the RNA-dependent RNA-polymerase in SARS-CoV-2 are very different on both strands, which may be possible given its multisubunit structure (16) , replication errors seem unlikely to explain the observed asymmetries. The strong strand asymmetries observed may be more consistent with RNA damage or RNA editing of the plus strand dominating the accumulation of mutations in SARS-CoV-2. The plus strand is the infectious genome, exported from the replication organelles, where minus molecules reside, into the cytoplasm (17), translated, packaged into particles and transmitted between cells and hosts. The plus strand is also present within cells in much larger numbers than the minus strand (14) . Thus, the plus strand may be expected to accumulate higher rates of damage or editing, which would manifest as strand asymmetries. If we accept this hypothesis, the dominance of C>U over G>A and G>U over C>A, when mapped to the plus strand, suggests that the dominant forms of RNA editing or damage are C>U and G>U. RNA-editing enzymes in human cells are able to mutagenise single-stranded DNA and RNA molecules and are known anti-viral mechanisms (18, 19) . Two families of RNA-editing enzymes in particular have been speculated to contribute to the mutational spectrum of SARS-CoV-2, APOBEC cytosine deaminase enzymes causing cytosine to uracil transitions and ADAR enzymes causing adenosine-to-inosine changes (A>G/U>C mutations) (19) . While we see a high rate of C>U changes in the SARS-CoV-2 spectra, we see much lower rates of A>G/U>C mutations than previously suggested (20) . While the mutational spectrum induced by all APOBEC enzymes is not fully understood, the activity of the betterunderstood APOBEC3A and 3B enzymes in human cancers is distinctly characterised by a strong sequence context, lead-ing to C>T and C>G changes almost solely at TpC sites (21) . This contrasts with the weak context dependence observed in the SARS-CoV-2 spectrum. While RNA-editing enzymes may contribute to SARS-CoV-2 mutagenesis, direct damage of cytosine and guanine bases could also be consistent with the observed spectrum. For example, spontaneous cytosine deamination would cause C>U transitions while oxidation of guanine bases could explain G>U transversions (22) . Both forms of damage are common mutagenic processes in human cells (23) . Having described the mutational spectrum, we can also derive approximate estimates of the average number of withinhost mutations per viral genome in a sample. Since the al-D R A F T lele frequency of a mutation represents the fraction of viral RNA molecules in a sample carrying a mutation, we can estimate the mean within-host mutation load by summing the allele frequencies of all observed within-host variants in a sample (12) . Importantly, this is an approximation as it assumes that each RNA molecule sequenced, originated from the entire viral genome which is not always the case given the abundance of viral sgRNAs (24, 25) . These estimates are also likely to be a conservative lower-bound as they only include mutations at detectable allele fractions in both replicates. Across samples, we found a mean within-host mutation load per viral genome of 0.72 mutations (median 0.37) ( Figure 1D ). Phylogenetic studies estimate a mutation rate in SARS-CoV-2 of 0.001 mutations/bp/year (26), or 0.082 mutations/genome/day. Thus, the observed within-host mutation load would be consistent with the expected acquisition of mutations during a relatively short span of several days. To investigate the possible accumulation of de novo mutations during the course of an infection, we studied 43 individuals for whom we had multiple samples collected at different timepoints ( Figure 3A , Supplementary Figure 4 ). Overall the number of within-host variants tended to increase over time and this trend was significant using a Poisson mixed model to control for host-specific effects (p=0.007; Figure 3B ). To put this finding in context, there was considerable variation in the observed number of within-host variants among samples from the same individual, even if taken on the same day, possibly as a result of bottlenecks caused by the different sampling methods which included sputum, swabs and bronchoalveolar lavage (Supplementary Figure 5) . High variability between longitudinal samples has also been observed in six out of nine hospitalised patients in Austria (13) . Using data from nine patients with longitudinal samples, the authors found that the divergence in variant frequencies between serial samples from the same patient was on average greater than that between donor-recipient transmission pairs. The authors observed multiple instances of the fixation of a variant over the course of an infection. In contrast, we observed no change in the consensus genome sequence in any of the individuals from whom multiple samples were collected. It is possible that the accumulation of within-host mutations could in future be used to estimate time since infection. However, determining if a consistent signal could be observed given the extensive variation observed between repeated samples on the same day would require analysis of time-series data in a larger number of individuals with high quality epidemiological data. Within-host variants display strong purifying selection against nonsense mutations. To study the extent of selection acting on within-host variants and major variants, we calculated dN/dS ratios using the dNdScv package (Methods). Most commonly-used software to calculate dN/dS use simple substitution models (often using a single transitiontransversion ratio), which can lead to considerable biases and false signals of selection under neutrality (28, 29) . dNdScv uses a Poisson framework allowing for complex substitution models including context dependence, strand asymmetry, non-equilibrium of substitutions and estimation of dN/dS ratios for missense and nonsense mutations separately. This model is particularly suitable for datasets with low mutation density in lowly-or non-recombining populations, as it is the case in SARS-CoV-2 genomic data (28) . As expected, dN/dS ratios for major variants are under clear purifying selection ( Figure 4A ), with particularly strong selection against nonsense mutations. Within-host variants reaching moderately high allele frequencies (VAF>10%) display similarly strong purifying selection against missense and nonsense mutations as major variants ( Figure 4A ), while within-host variants at low allele frequencies appear to be under more relaxed purifying selection, as it may be expected. Purifying selection against nonsense mutations can also be observed at the level of allele frequencies ( Figure 4B ). Overall, the similarity of dN/dS ratios for major and moderate-VAF within-host variants suggests that selection within hosts, rather than during transmission, may explain a considerable fraction of the extent of purifying selection observed in consensus sequences. Many recurrently mutated sites appear to represent mutational hotspots. Some sites across the SARS-CoV-2 genome appear to have mutated independently multiple times, resulting in homoplasic sites across the viral phylogeny (30) . Some of these sites have also been reported to recurrently appear as within-host variants, although it remains unclear to what extent these events represent recurrent sequencing errors, contamination between samples, coinfection of a sample by multiple lineages, recombination, mutational hotspots or convergent positive selection (3, 13, 30) . Coinfection by multiple lineages could possibly arise from superinfection, where an individual acquires infection from multiple sources, or by co-transmission of multiple lineages from host to host following an episode of superinfection. Figure 4C represents the distribution of recurrent within-host variants across the genome. 190 sites are observed as withinhost variants in at least 5 (0.4%) of our samples. Sequencing or PCR errors are unlikely to contribute substantially to the recurrent variants observed in our dataset thanks to the use of replicates and the ShearwaterML algorithm. One mechanism by which the same within-host variant can be observed across multiple closely-related samples is transmission of withinhost variants between contacts, when a population of virions is transmitted between hosts. Under this scenario, sharing of within-host variants will be expected between samples with identical major variants, as the preservation of a within-host variant is incompatible with the simultaneous fixation of new major variants (the within-host allele would either be purged or hitchhike and become fixed). Instead, we find that most recurrent within-host variants occur across lineages, independently of the genomic background of major variants present in a sample ( Figure 4E ). This pattern is suggestive of recurrent mutation, although it could also be consistent with complex histories of superinfection and recombination. If we accept the hypothesis of recurrent mutation, there remains the question of whether this is caused by mutational hotspots or convergent positive selection. We observed that 25% (51/205) of recurrent (n ≥ 5) within-host variants are predicted to cause synonymous mutations and estimates of the dN/dS ratio corrected for the trinucleotide sequence context indicate that, as a group, these recurrent variants are under some purifying selection ( Figure 4A ). This suggests that most recurrent within-host variants are likely to represent hypermutable sites rather than convergent positive selection. Their mutational spectrum suggests an enrichment for C>U changes, particularly at sites preceded by a pyrimidine, but the mechanisms behind their apparent hypermutability remain unclear (Supplementary Figure 6) . Still, this analysis does not rule out the possibility that a minority of recurrent within-host variants are the result of convergent positive selection. A plausible example is the spike glycoprotein mutation D614G, which has rapidly increased in frequency throughout the world: this appears as a within-host variant in 12 of our samples, and typically with a higher allele frequency than other recurrent within-host variants (median VAF 0.56 vs 0.036, Wilcoxon test P=3.98e-4). Extensive sharing of minority variants across a diverse genomic background suggests caution is needed when using within-host variants for the inference of transmission. In order to investigate the genetic background of our samples we generated a maximum likelihood phylogeny of all consensus genomes produced by the COG-UK consortium by the end of May 2020, including those for the samples on our dataset ( Figure 5A ). This showed that our samples represent a broad range of the SARS-CoV-2 genetic diversity found in the UK at that time, and that the diversity observed among our samples was not primarily driven by geographical location. To explore the relationship between within-host variants and the consensus phylogeny, we identified within-host variants that are shared between samples, as illustrated by links between the tips of the phylogenetic tree in Figure 5B . This confirmed that within-host variants are often shared between samples that are distant on the consensus phylogeny. Both a high level of recombination and a large transmission bottleneck would be required for within-host variants to be maintained across long transmission chains in order to explain the simultaneous preservation of some within-host variants with the fixation of others. While, recent studies have suggested that the bottleneck in SARS-CoV-2 transmission can be on the order of 10 2 -10 3 virions, they also identified substantial overlaps in the fraction of shared minority variants between samples unrelated by close transmission (13, 31) . The sharing of within-host variants between consensus genomes as divergent as 10 SNPs, indicating multiple months of separation between the samples suggests that a more likely explanation is that many of these shared within-host variants are the result of recurrent mutation or co-infection ( Fig 6D) . To further investigate the correspondence between shared within-host variants and transmission, we used the transcluster algorithm to infer the probability distribution of the number of intermediate hosts that separate each pair of samples (32) . This pairwise approach accounts for the serial interval of the virus as well as its evolutionary rate using the difference in the number of SNPs between each pair of consensus genomes. Fig 6B illustrates the relationship between the mean number of shared within-host variants and the probability that the respective infections were a result of direct transmission. A weak correlation with transmission probability was found for shared within-host variants of high allele frequency, but not for those of low allele frequency. Co-infections correlate with the consensus phylogeny consistent with cases of superinfection induced by population structure. It has been proposed that coinfection by different lineages is a significant source of within-host variation in SARS-CoV-2 (3). Co-infection by multiple lineages would result in the allele frequencies of divergent bases to appear as within-host variants and reflect the proportions of either haplotype. Detecting instances of co-infection is important in the context of both transmission and to accurately characterise the within-host mutation rate (9, 33) . To identify possible co-infected samples we compared the site frequency spectrum for each of our samples with all of the consensus genomes in the COG-UK dataset at the end of May 2020. A linear model was used to identify mixtures of two consensus genomes that could explain the allele frequencies of variants within a sample better than a single consensus genome. Additional mixtures were considered if two consensus genomes could explain at least two additional within-host variants over that of a single consensus genome. In this case the additional consensus genome was restricted to have at most one variant that was not found in the sample. Samples identified by the model as being a pos- sible mixture of consensus genomes were then visually inspected to determine whether the putative co-infections were convincing. This resulted in 36 putative co-infected samples, with a representative example shown in Figure 6A . The frequencies of within-host variants of all 36 samples, along with consensus genomes that comprise the putative mixture, are shown in Supplementary Figure 7 . To determine the potential impact of co-infections on transmission inference we excluded all 36 samples and re-ran the pairwise transmission analysis using the transcluster algorithm on the remaining samples. Figure 6C indicates that the remaining within-host variants no longer correlate as well with the probability of direct transmission, suggesting that much of the correlation observed previously was potentially driven by co-infections. It has been suggested that correlations between co-infections and the consensus phylogeny could be driven by the cotransmission of multiple strains (3). However, as the transmission signal in the minority variants in the remaining samples has reduced; a more likely explanation is that coinfection of certain strains is driven by multiple episodes of superinfection (co-infection from two different infection sources). As these samples and those from the Lythgoe study were acquired from hospitalised patients, the correspondence with the consensus phylogeny could then be driven by structured superinfection within hospital COVID-19 wards. The prevalence of infection within hospitals, particularly during the first 'wave' of the COVID-19 pandemic was often considerably higher than in the community with up to 50% of the consultant A&E workforce at one Welsh hospital testing positive in April, 2020 (34) . This contrasts with the finding that approximately 6% of the UK population had been infected with SARS-CoV-2 by the end of June, 2020 (35) . Within hospital transmission was also common in the early stages of the outbreak in Wuhan with 41% of 138 patients were found to have contracted SARS-CoV-2 in hospital (36). While it is not possible to rule out cross-contamination between samples, this contamination would have to be structured to explain the correlations between the inferred co-infections and the consensus phylogeny. Batch effects were carefully controlled for by Lythgoe et al., and as we identified a similar signal using an independent dataset with samples sequenced in replicate, structured superinfection provides a better explanation. In cases of higher prevalence, repeated episodes of superinfection rather than co-transmission could complicate the use of within-host minority variants in transmission inference ( Figure 6D ). We find a considerable amount of genetic diversity within individual SARS-CoV-2 infections that cannot be explained by technical artefacts. This is consistent with the quasispecies population structure typically found in RNA viruses (5, 37), including related betacoranaviruses SARS-CoV-1 and MERS (38, 39) . By analysing the frequency of variants within individual samples, we estimate that each viral genome has an average of 0.72 variants relative to the consensus genome sequence for that sample. Since most of these samples were probably collected more than a week after the time of infection, and since the SARS-CoV-2 genome is known to acquire approximately one mutation every two weeks, these findings are broadly consistent with the hypothesis that within-host variation is largely due to the accumulation of de novo mutations within the host. Further support for this hypothesis comes from our analysis of hosts sampled at multiple timepoints, showing that the number of within-host variants tends to increase during the course of an infection. Increased numbers of within-host variants have also been observed in immunocompromised patients: consistent with the acquisition of de novo mutations within the host (40) . These data show that within-host variants have a similar mutational spectrum to major variants that define betweenhost variation. Both are characterised by clear purifying selection, as would be expected if virions with disadvantageous mutations failed to survive or propagate within the host. Strikingly, the mutational spectrum of SARS-CoV-2 exhibits a near complete asymmetry between the plus and minus strands, and is dominated by C>U and G>U mutations. This seems consistent with RNA-damage or RNA-editing of the plus strand. Whilst mutagenesis by APOBEC enzymes could play a role, the spectrum we observe is very different to that seen in human cells. Instead, direct damage of cytosine and guanine bases could also be consistent with the observed spectrum. Many of the within-host variants that we have identified are shared between infected individuals located on distant branches of the consensus phylogeny, and this is congruent with previous reports that the SARS-CoV-2 consensus phylogeny has many homoplasies that also appear as within-host variants (3) . It appears likely that this is largely due to recurrent mutation although co-infection between lineages could also be a relevant factor in high transmission settings. The vast majority of our samples appear to comprise a single lineage but we find evidence of putative co-infection by multiple lineages in approximately 3-4% of samples, which is likely an underestimate of the extent of co-infection in our dataset. In other viral and bacterial diseases, within-host variants provide a valuable source of information for the inference of transmission chains (33, (41) (42) (43) and have been shown to improve the accuracy of inferences based on consensus genomes (44) (45) (46) (47) . In the case of SARS-CoV-2, Lythgoe and colleagues have reported a geographical structure to the sharing of within-host variants, and have proposed that this might be explained by localised episodes of superinfection (where an individual acquires infection from multiple sources) and subsequent co-transmission of multiple lineages and their recombinants through the local host population (3). We find some evidence of a correlation between sharing of withinhost variants and the probability of direct transmission, a signal that we find is partially driven by co-infection, but the current dataset lacks the epidemiological resolution to evaluate the utility of this in reconstructing transmission chains, which might depend on the prevalence of infection and other local circumstances such as superinfection within COVID-19 hospital wards. In summary, these data show that much of the genetic diversity of the SARS-CoV-2 viral population resides within individual hosts. Within-host diversity is generated by the accumulation of de novo mutations during the course of infection but can also result from co-infection with different lineages. Analysis of within-host variation could potentially be used to improve the inference of transmission chains, but this approach requires caution because of the confounding effects of recurrent mutation and unrelated episodes of superinfection. More detailed studies are required to evaluate the transmission bottlenecks that govern the propagation of within-host variants from host to host, and to examine the patterns of within-host variation associated with different epidemiological circumstances. Sample selection and ethics. The 1,181 samples were taken as a random subset from a larger prospective study into SARS-CoV-2 infections at Cambridge University Hospitals National Health Service Foundation Trust (CUH; Cambridge, UK), a secondary care provider and tertiary referral centre in the East of England (4, 48) . This study was done as part of surveillance for COVID-19 under the auspices of Section 251 of the National Health Service Act 2006. It therefore did not require individual patient consent or ethical approval. The COVID-19 Genomics UK (COG-UK) study protocol was approved by the Public Health England Research Ethics Governance Group. Variant calling. Variant calling was performed using Shear-waterML, which is available as part of the deepSNV R package (11, 12) . In order to create a base-specific error model for the SARS-CoV-2 genome, we randomly selected sequencing data from 100 samples (50 from each set of duplicates) from the 468 samples that met the following criteria: (1) ρ value from beta-binomial model ≤ 0.02; (2) proportion of genome with at least 500× coverage ≥ 0.9 for both duplicates; (3) absolute difference in median coverage between replicates ≤ 20, 000; and (4) only one sample sequenced from a given donor. For each sample, non-reference sites with VAF ≥ 0.01 were set as uninformative (i.e. depth for all base types were set to 0) in order to enable calling of recurrent, high VAF mutations. However, two sites (11074 and 25202) were effectively excluded from variant calling as they were set as uninformative for all samples in the normal panel. Variants were called separately for each set of duplicates. The initial ShearwaterML calls were filtered using the following criteria: (1) total depth at variant site ge100×; and (2) Benjamini-Hochberg False Discovery Rate q-value ≤ 0.05 in one duplicate and unadjusted p-value ≤ 0.01 in the other duplicate. The adjustment of p-values was performed by considering the five mutation types (three non-reference bases, insertions and deletions) at all sites with ≥ 100× coverage in the 1,181 samples from a given duplicate set. Variants at consecutive sites were merged into single events if the difference in VAFs ≤ 0.05. Beta-binomial modelling of replicate samples. To quantify the level of discordance in the allele frequencies between technical replicates, for each pair replicates we first identified a set of variable sites, which are expected to contain both artefacts and within-host mutations. All non-reference variable sites with coverage in both replicates > 100x and mean VAF from both replicates higher than 1% and lower than 90% were considered for the beta-binomial modelling. In Illumina sequencing protocols where adapters are ligated before amplification and PCR duplicates can be removed computationally, VAFs may be expected to show binomial variation between technical replicates. However, amplicon sequencing can lead to preferential amplification of some molecules leading to higher variation in VAFs. We quantified the extent of variation above binomial sampling in the VAFs of variable sites between replicates by fitting a beta-binomial model to the allele counts of variable sites, obtaining a maximum likelihood estimate of the ρ (overdispersion) parameter for each pair of replicate samples. Let x 1,j and x 2,j be the alternative (non-reference) allele counts of variable site j in replicates 1 and 2 of a given sample, and n 1,j and n 2,j be the local coverages at the site. We can calculate an approximate likelihood for the beta-binomial model using the following equation, where P depicts the beta-binomial density function: A maximum likelihood estimate for the overdispersion parameter was obtained by grid search. The overdispersion pa-rameter, measuring the level of discordance in allele frequencies between the replicates, was found to correlate strongly with the Ct value of the sample. Mutational spectrum. The mutational spectra shown in Figure 1 were generated assuming that the allele in the reference sequence (SARS-CoV-2 isolate Wuhan-Hu-1, MN908947.3) represents the ancestral allele. The normalised mutation rates (r) for each of the 192 possible changes (j) in a trinucleotide context were calculated as: Where n j is the total number of mutations observed for a trinucleotide change j, and L j is the total number of times that the corresponding trinucleotide is present in the reference genome (MN908947.3). When observing the same mutation in multiple samples, it is not always straightforward to determine the number of independent mutational events that this represents. For major variants, this can be better estimated using a phylogeny, although recurrent hotspots can cause errors in the phylogenetic reconstruction. For simplicity, mutations observed across samples as major variants were counted only once for the estimation of n j . For within-host variants, a more relaxed approach was used since, based on our results, multiple instances of the same within-host variant across samples are more likely to represent independent events. Within-host variants observed in more than 5 samples were only counted 5 times for the calculation of n j , to avoid a small number of highly-recurrent hotspots from dominating the trinucleotide spectrum. Selection analysis. Analyses of selection were carried out using the dNdScv package (28) . In contrast to traditional implementations of dN/dS developed for divergent sequences, which rely on Markov-chain codon substitution models (49) , dNdScv was developed for comparisons of closely related genomes. Under low divergence rates, such as those in the densely sampled SARS-CoV-2 phylogeny, observed changes typically represent individual mutational events, which can be modelled using a Poisson framework instead of Markovchain substitution models. This enables the use of more complex and realistic substitution models, including context dependence, strand asymmetry, non-equilibrium sequence composition and separate estimation of dN/dS ratios for missense and nonsense mutations. These changes can be important to avoid false signals of negative or positive selection under neutrality when using simplistic substitution models (28) . Here we used the default substitution model in dNd-Scv, which uses 192 rate parameters to model all possible mutations in both strands separately in a trinucleotide context, as well as two ω parameters to estimate dN/dS ratios for missense and nonsense mutations separately. dNdScv files and code needed to generate dN/dS ratios from SARS-CoV-2 data are available as Supplementary Code. dN/dS ratios on polymorphism data need to be interpreted with caution. This is both because dN/dS ratios can be D R A F T time dependent (50) , providing weaker signals of selection for more recent changes, and because dN/dS ratios can behave non-monotonically with respect to selection coefficients under idealised conditions of free recombination when using nucleotide diversity within a population (51) . The latter can result from strong positive selection causing a reduction in the effective population size (and in nucleotide diversity) of non-synonymous sites under free recombination. However, the potential loss of monotonicity should not be a concern in our analyses of SARS-CoV-2 data. This is both because free recombination between adjacent sites is unrealistic for SARS-CoV-2 and because collections of mutations were identified by comparison of derived sequences to an ancestral reference genome, instead of by comparison of two derived sequences at any one time. phylogeny construction. Consensus genomes for each replicate were generated using the ARCTIC SARS-CoV-2 bioinformatics pipeline. A multiple sequence alignment was generated using MAFFTv7.464 and any sites that were discordant between replicates were set to be ambiguous (52) . Sites that have previously been identified to create difficulties in generating phylogenies were masked using bedtools v2.29.2 using the VCF described in De Maio et al, 2020 (30, 53) . Fasttree v2.1.11 was used to generate a maximum likelihood phylogeny (54) . Transmission model. To investigate transmission, samples were only considered if both replicates produced high quality consensus genomes. When multiple samples from the same host were available the earliest sample was used. Pairwise SNP distances were generated between the consensus genomes using pairsnp v0.2.0 (55) . The distribution of the underlying number of intermediate transmission events between each pair of samples was then inferred using an implementation of the transcluster algorithm (32, 56) . The serial interval and evolutionary rate were set to 5 days and 1e-3 substitutions/site/year (26, 57) . Potential mixed infections were identified using a linear model by testing whether the allele frequencies in a sample could be better explained by the inclusion of an additional consensus genome from the COG-UK dataset of the 29th May 2020. Additional samples mixtures were considered if the addition of a COG-UK consensus genome could explain at least 2 iSNVs and have at most 1 variant that was not found in the alleles of the sample. This identified 54 putative mixtures which were then screened manually to obtain 36 potentially mixed samples. The code used to run this analysis is available in the supplementary materials. ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT ATA ATC ATG ATT CTA CTC CTG CTT GTA GTC GTG GTT TTA TTC TTG TTT ATA ATC ATG ATT CTA CTC CTG CTT GTA GTC GTG GTT TTA TTC TTG TTT ATA ATC ATG ATT CTA CTC CTG CTT GTA GTC GTG GTT TTA TTC TTG T>A T>C T>G ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT ATA ATC ATG ATT CTA CTC CTG CTT GTA GTC GTG GTT TTA TTC TTG TTT ATA ATC ATG ATT CTA CTC CTG CTT GTA GTC GTG GTT TTA TTC TTG TTT ATA ATC ATG ATT CTA CTC CTG CTT GTA GTC GTG GTT TTA TTC TTG World Health Organization. Weekly update on COVID-19 An integrated national scale SARS-CoV-2 genomic surveillance network. The Lancet Microbe Shared SARS-CoV-2 diversity suggests localised transmission of minority variants Rapid implementation of SARS-CoV-2 sequencing to investigate cases of healthcare associated COVID-19: a prospective genomic surveillance study Viral quasispecies Rapid evolution of RNA genomes Hidden genomic diversity of SARS-CoV-2: implications for qRT-PCR diagnostics and transmission Franziska Michor, Christoph Bock, and Andreas Bergthaler. Mutational dynamics and transmission properties of SARS-CoV-2 superspreading events in austria Human cytomegalovirus haplotype reconstruction reveals high diversity due to superinfection and evidence of within-host recombination COVID-19 ARTIC v3 illumina library construction and sequencing protocol v4 Subclonal variant calling with multiple samples and prior knowledge Tumor evolution. high burden and pervasive positive selection of somatic mutations in normal human skin Genomic epidemiology of superspreading events in austria reveals mutational dynamics and transmission properties of SARS-CoV-2 A contemporary view of coronavirus transcription Coronavirus biology and replication: implications for SARS-CoV-2 The strand-biased transcription of SARS-CoV-2 and unbalanced inhibition by remdesivir. bioRxiv A molecular pore spans the double membrane of the coronavirus replication organelle APOBEC3A and APOBEC3B preferentially deaminate the lagging strand template during DNA replication Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2 Genomic diversity of severe acute respiratory Syndrome-Coronavirus 2 The repertoire of mutational signatures in human cancer Uracil in DNA-occurrence, consequences and repair Mechanisms underlying mutational signatures in human cancers Katrin Zwirglmaier, Christian Drosten, and Clemens Wendtner. Virological assessment of hospitalized patients with COVID-2019 SARS-CoV-2 virus culture and subgenomic RNA for respiratory specimens from patients with mild coronavirus disease Coast-to-Coast spread of SARS-CoV-2 during the early epidemic in the united states Signatures of mutational processes in human cancer Universal patterns of selection in cancer and somatic tissues Mutational signatures are critical for proper estimation of purifying selection pressures in cancer somatic mutation data when using the dN/dS metric Issues with SARS-CoV-2 sequencing data Superspreading events without superspreaders: Using high attack rate events to estimate n0 for airborne transmission of COVID-19 Beyond the SNP threshold: Identifying outbreak clusters using inferred transmissions The Maela Pneumococcal Collaboration, and The BEEHIVE Collaboration. PHY-LOSCANNER: Inferring transmission from within-and Between-Host pathogen genetic diversity COVID-19: the case for health-care worker screening to prevent hospital transmission Antibody prevalence for SARS-CoV-2 in england following first peak of the pandemic Clinical characteristics of 138 hospitalized patients with 2019 novel Coronavirus-Infected pneumonia in wuhan, china Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population Spike gene deletion quasispecies in serum of patient with acute MERS-CoV infection: MERS-CoV spike gene deletion SARS-associated coronavirus quasispecies in individual patients SARS-CoV-2 genomic and quasispecies analyses in cancer patients reveal relaxed intrahost virus evolution Bayesian reconstruction of transmission within outbreaks using genomic variants Shared genomic variants: Identification of transmission routes using pathogen Deep-Sequence data Rapid outbreak sequencing of ebola virus in sierra leone identifies transmission chains linked to sporadic cases Within-host bacterial diversity hinders accurate reconstruction of transmission networks from genomic distance data Epidemic reconstruction in a phylogenetics framework: Transmission trees as partitions of the node set SCOTTI: Efficient reconstruction of transmission within outbreaks with the structured coalescent When are pathogen genome sequences informative of transmission events? Dominic Kwiatowski, Ian Goodfellow, M Estee Torok, and COVID-19 Genomics Consortium UK. COVID-19 infection dynamics in care homes in the east of england: a retrospective genomic epidemiology study. medRxiv A codon-based model of nucleotide substitution for protein-coding DNA sequences Comparisons of dN/dS are time dependent for closely related bacterial genomes The population genetics of dN/dS MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform BEDTools: a flexible suite of utilities for comparing genomic features FastTree 2 -approximately Maximum-Likelihood trees for large alignments Evolving epidemiology and transmission dynamics of coronavirus disease 2019 outside hubei province, china: a descriptive and modelling study CAMB−79FBA (B.1) − CAMB−79CB3 (B.1) CAMB−79837 (B.1.5) − PHWC−35882 (NA) A) − CAMB−7765D (B.1.11) CAMB−77990 (B.1) − BRIS−125D1C (B.1) CAMB−76894 (B.1) − CAMB−768FE (B.1) CAMB−768D0 (B.1.13) − CAMB−7A5E4 (B.1.5) CAMB−76AE9 (B.1.5) − CAMB−7A5E4 (B.1.5) CAMB−76409 (B.1) − CAMB−7ADBF (B.1) CAMB−76472 (B.1) − CAMB−78BD5 (B.1) CAMB−76515 (B.1) − CAMB−71F76 (B.1) CAMB−71D03 (B.1) − CAMB−79697 (B.1) CAMB−71F0D (B.1) − CAMB−7DAF1 (B.2.5)