key: cord-0902385-ec3ewtqo authors: Ignatieva, Anastasia; Hein, Jotun; Jenkins, Paul A. title: Ongoing recombination in SARS-CoV-2 revealed through genealogical reconstruction date: 2021-06-16 journal: bioRxiv DOI: 10.1101/2021.01.21.427579 sha: 195995eb0724b8f3b2e06672436633ad988d34d2 doc_id: 902385 cord_uid: ec3ewtqo The evolutionary process of genetic recombination has the potential to rapidly change the properties of a viral pathogen, and its presence is a crucial factor to consider in the development of treatments and vaccines. It can also significantly affect the results of phylogenetic analyses and the inference of evolutionary rates. The detection of recombination from samples of sequencing data is a very challenging problem, and is further complicated for SARS-CoV-2 by its relatively slow accumulation of genetic diversity. The extent to which recombination is ongoing for SARS-CoV-2 is not yet resolved. To address this, we use a parsimony-based method to reconstruct possible genealogical histories for samples of SARS-CoV-2 sequences, which enables us to pinpoint specific recombination events that could have generated the data. We propose a statistical framework for disentangling the effects of recurrent mutation from recombination in the history of a sample, and hence provide a way of estimating the probability that ongoing recombination is present. We apply this to samples of sequencing data collected in England and South Africa, and find evidence of ongoing recombination. Ongoing mutation of the SARS-CoV-2 virus has received significant scientific and media attention since the start of the pandemic. The process of viral recombination has received far less coverage, but has the potential to have a drastic impact on the evolution of virulence, transmissibility, and evasion of host immunity (Simon-Loriere & Holmes, 2011). Recombination occurs when host cells are co-infected with different strains of the same virus, and during replication the genomes are reshuffled and combined before being packaged and released as new offspring virions, now potentially possessing very different pathogenic properties. This makes the presence of recombination a crucial factor to consider when developing vaccines and treatments. While the role of recombination between different coronaviruses in the emergence of SARS-CoV-2 has been widely studied, understanding its potential for ongoing recombination within human hosts has proved difficult. The detection of ongoing recombination from a sample of genetic data is, in general, a very challenging problem. Only a fraction of recombination events significantly change the shape of a genealogy, and even then, mutations must occur on the correct branches of the genealogy in order to create patterns that are detectable in the data (Hein et al., 2004, Section 5.11) . In evolutionary terms, a relatively short time period has passed since the start of the pandemic, so typical SARS-CoV-2 sequences differ only by a small number of mutations, meaning that recombination events are likely to be undetectable or leave only faint traces. Moreover, the effects of recombination can be indistinguishable from those of recurrent mutation (McVean et al., 2002) , where mutations have occurred at the same site multiple times in the history of the sample. Coronaviruses are known to have relatively high recombination rates (Su et al., 2016) , and cell culture studies indicate that this holds true for SARS-CoV-2 (Gribble et al., 2021) . This suggests that ongoing intra-host recombination since the start of the pandemic should be commonplace, but detection efforts are thwarted by the slow accumulation of genetic diversity. Early evidence of ongoing recombination in SARS-CoV-2 was presented by Yi (2020), who identified the presence of loops in reconstructed phylogenetic networks, which can arise as a consequence of recombination. A number of more recent reports have utilised methods based on classifying sequences into clades, and searching for those that appear to carry a mix of mutations characteristic to more than one clade. VanInsberghe et al. Richard et al., 2020) . In general, a relatively small number of putative recombinant sequences have been identified to date, and there is a lack of compelling evidence for widespread recombination in SARS-CoV-2. Given the aforementioned causes for studies to be underpowered, the overall extent and importance of ongoing recombination for SARS-CoV-2 remains not fully resolved. Phylogenetic analysis of SARS-CoV-2 data largely assumes the absence of recombination. Recombination can significantly influence the accuracy of phylogenetic inference (Posada & Crandall, 2002) , distorting the branch lengths of inferred trees and making mutation rate heterogeneity appear stronger (Schierup & Hein, 2000) . Moreover, when analysing data at the level of consensus sequences, the genealogy of a sample is related to the transmission network of the disease, with splits in the genealogy relating to the transmission of the virus between hosts. Models used for constructing genealogies and inferring evolutionary rates for this type of data cannot fully incorporate potentially important factors, such as geographical structure, patterns of social mixing, travel restrictions, and other nonpharmaceutical interventions, without making inference intractable. Relying on standard tree-based models can easily lead to biased estimates, with the extent of the error due to model misspecification being very difficult to quantify. In this article, we use KwARG (Ignatieva et al., 2021) , a parsimony-based method for reconstructing possible genealogical histories, to detect and examine crossover recombination events in samples of viral consensus sequences. This approach provides a concrete way of describing their genealogical relationships, sidestepping the challenges presented by discrepancies in clade assignment, enabling the detection of intra-clade recombination, avoiding the need to specify a particular model of evolution, and allowing for the explicit identification of possible recombination events in the history of a sample. Our method naturally handles both recombination and recurrent mutation, identifying a range of possible explicit genealogical histories for the dataset with varying proportions of both events types. Rather than using summary statistics calculated from the data, or focussing only on patterns of cladedefining SNPs, our method utilises all of the information contained in the patterns of incompatibilities observed in a sample, allowing for powerful detection and identification of possible recombinants. Moreover, we provide a nonparametric framework for evaluating the probability of a given number of recurrent mutations, thus quantifying how many recombinations are likely to have occurred in the history of a dataset. This allows for a more thorough and statistically principled assessment of the extent to which ongoing recombination is occurring. We investigate the presence of ongoing recombination in SARS-CoV-2 using publicly available data from GISAID , collected between November 2020 and February 2021. Using data from South Africa, we demonstrate that our method can detect recombination both when the sample contains sequences from multiple distinct lineages ('inter-clade'), as well as all from the same lineage ('intra-clade'). Further, we show that our method can accurately detect consensus sequences carrying patterns of mutations that are consistent with recombination, flagging these sequences for further investigation -and we demonstrate, using data from England, that it can identify both sequences arising as a result of sequencing errors due to sample contamination, aiding in identifying quality control issues, as well as sequences likely to be true recombinants. We validate our method using extensive simulation studies, and through application to MERS-CoV data, for which we find evidence of recombination, in agreement with previous studies. 2.1. SARS-CoV-2 data. Sequences were downloaded from GISAID, and aligned as described in SI Appendix, Section S1.1. Masking was applied to sites at the endpoint regions of the genomes, any multi-allelic sites, regions with many missing nucleotides in multiple sequences, and sites identified by De Maio et al. (2020) as being highly homoplasic or prone to sequencing errors. Strict quality criteria were applied, as detailed in SI Appendix, Section S1.2, to remove any sequences with a large number of ambiguous nucleotides, multiple non-ACTG characters, excessive gaps, and groups of clustered SNPs; additionally, sites identified by van Dorp et al. (2020a) as being prone to recurrent mutation were masked. These measures were aimed at reducing the possibility of including poor quality or contaminated sequences in the analysed samples, and also masking sites that are known to be highly homoplasic (either due to recurring sequencing errors, or due to the effects of selection). Four samples were analysed: from South Africa, collected in (i) November 2020 (50 sequences, with 25 from lineage B.1.351, and 25 from other lineages) and (ii) February 2021 (38 sequences, all from lineage B.1.351), and from England, collected in (iii) November 2020 (80 sequences, with 40 sequences from lineage B.1.1.7 and 40 from other lineages within GISAID clade GR) and (iv) December 2020 -January 2021 (40 sequences within GISAID clade GR). Details of sample selection and processing are given in SI Appendix, Sections S1.3-S1.6. Overview of methods. Our method consists of two main steps. Firstly, using KwARG, plausible genealogical histories are reconstructed for each sample, with varying proportions of posited recombination and recurrent mutations events. Then, using simulation, we approximate the distribution of the number of recurrent mutations that might be observed in a dataset of the same size as each sample. We use this to establish which of the identified genealogical histories is more plausible for the data at hand, and thus whether the presence of recombination events in the history of the given sample is likely. This can be framed in the language of statistical hypothesis testing. The 'null hypothesis' is the absence of recombination. The test statistic T is the number of recurrent mutations in the history of the dataset; the null distribution of T is approximated through simulation. The observed value T obs is the minimal number of recurrent mutations required to explain the dataset in the absence of recombination, as estimated by KwARG. The 'p-value' is the probability of observing a number of recurrent mutations equal to or greater than T obs . Small p-values allow us to reject the null hypothesis, providing evidence that recombination has occurred. The reconstructed genealogies then allow for the detailed examination of possible recombination events in the history of the sampled sequences. We emphasise that we make very conservative assumptions throughout, both in processing the data and in estimating the distribution of the number of recurrent mutations. Moreover, the number of recurrent mutations required to explain a given dataset computed by KwARG is (or is close to) a lower bound on the actual number of such events, and is likely to be an underestimate, making the reported p-values larger (more stringent). Reconstruction of genealogies. The first step in our approach is to use a parsimony-based method to reconstruct possible genealogical histories for the given datasets. 2.3.1. Incompatibilities in the data. Suppose that each site of the genome can mutate between exactly two possible states (thus excluding the possibility of triallelic sites, which we have masked from the data). Then the allele at each site can be denoted 0 or 1. If the commonly used infinite sites assumption is applied, at most one mutation can affect each site of the genome. The four gamete test (Hudson & Kaplan, 1985) can then detect the presence of recombination: if all four of the configurations 00, 01, 10, 11 are found in any two columns, then the data could not have been generated through replication and mutation alone, and at least one recombination event must have occurred between the two corresponding sites; the sites are then termed incompatible. If the infinite sites assumption is violated, the four gamete test no longer necessarily indicates the presence of recombination, as the incompatibilities could instead have been generated through recurrent mutation (McVean et al., 2002). Ancestral recombination graphs (ARGs). All of the viral particles now in circulation had a common ancestor at the time of emergence of the virus, so sequences sampled at the present time can be united by a network of evolution going back to this shared ancestor through shared predecessors, termed the ancestral recombination graph (ARG) (Griffiths & Marjoram, 1997) . As the sample consists of consensus sequences (at the level of one sequence per host), an edge of this network represents a viral lineage, possibly passing through multiple hosts before being sequenced at the present. An example of an ARG topology can be seen in Figure 4 . Mutations are represented as points on the edges, labelled by the sites they affect. Considering the graph backwards in time (from the bottom up), the point at which two edges merge represents the time at which some sequences in the data coalesced, or have found a common ancestor. A point at which an edge splits into two corresponds to a recombination -the parts of the genome to the left and to the right of the breakpoint (whose site number is labelled inside the blue recombination node) are inherited from two different parent particles. The network thus fully encodes the evolutionary events in the history of a sample. A sample of genetic sequences may have many possible histories, with many different corresponding ARGs. The parsimony approach to reconstructing ARGs given a sample of genetic data focuses on minimising the number of recombination and/or recurrent mutation events. This does not necessarily produce the most biologically plausible histories, but it does provide a lower bound on the number of events that must have occurred in the evolutionary pathway generating the sample. Thus, recombination can be detected in the history of a sample by considering whether the most plausible parsimonious solutions contain at least one recombination node. Crucially, the parsimony approach does not require the assumption of a particular generative model for the data (such as the coalescent with exponential growth) beyond specifying the types of events that can occur. While this means that mutation and recombination rates cannot be inferred, it allows us to sidestep the need to specify a detailed model of population dynamics, which is particularly challenging for SARS-CoV-2 data. A parsimony-based approach is more appropriate when our focus is on interrogating the hypothesis that recombination is present at all. It also allows for the explicit reconstruction of possible events in the history of a sample, and thus allows us to identify recombinant sequences and uncover patterns consistent with the effects of sequencing errors. KwARG. KwARG (Ignatieva et al., 2021) is a program implementing a parsimony-based heuristic algorithm for reconstructing plausible ARGs for a given dataset. KwARG identifies 'recombination only' solutions (all incompatibilities are resolved through recombination events) and 'recurrent mutation only' solutions (all incompatibilities are resolved through additional mutation events), as well as interpolating between these two extremes and outputting solutions with a combination of both event types. KwARG allows for missing data and disregards insertions and deletions (we have deleted insertions from the alignment and treat deletions as missing data). KwARG seeks to minimise the number of posited recombination and recurrent mutation events in each solution, and the proportions of the two event types can be controlled by specifying 'cost' parameters. KwARG distinguishes between recurrent mutations that occur on the internal branches of the ARG from those can be placed on the terminal branches, which affect only one sequence in the input dataset, so can be examined separately for indications that they arose due to errors in the sequencing process. KwARG was run on the data samples as detailed in SI Appendix, Section S3; an overview of the identified solutions is given in Tables 1a-1d. 0.00 2 · 10 −3 2 10 0.00 < 1 · 10 −6 1 12 0.00 < 1 · 10 −6 0 16 0.00 < 1 · 10 −6 Table 1 . Summary of solutions identified by KwARG for each sample, and the probability of observing the corresponding number of recurrent mutations. First column: number of recombinations. Second column: number of recurrent mutations. Third column: probability of observing a number of recurrent mutations equal to that in the second column. Fourth column: corresponding p-values (probability of observing a number of recurrent mutations equal to or greater than that in the second column). Evaluation of solutions. The next step in our approach is to determine which of the solutions identified by KwARG is more likely, by calculating the probability of observing the given number of recurrent mutations. To avoid making model-based assumptions on the genealogy of the sample, we use a nonparametric method inspired by the homoplasy test of Maynard Smith & Smith (1998) . The homoplasy test estimates the probability of observing the minimal number of recurrent mutations required to generate the sample in the absence of recombination, i.e. if the shape of the genealogy is constrained to be a tree. If this probability is very small then it provides evidence for the presence of recombination. The method is particularly powerful when the level of divergence between sequences is very low, as is the case with SARS-CoV-2 data, although it appears prone to false positives in the presence of severe mutation rate heterogeneity along the genome (Posada & Crandall, 2001) . We calculate an empirical estimate P of mutation density along the genome from SARS-CoV-2 data, which does not suggest the presence of extreme heterogeneity, and then use this estimate to simulate the distribution of the number of recurrent mutations that are observed in a sample. The i-th entry of the vector P , for i ∈ {1, . . . , 29 903}, gives an estimated probability that when a mutation occurs, it affects the i-th site of the genome. Details of our method for estimating P are presented in SI Appendix, Section S4. Briefly, this estimate is calculated by examining the locations of sites that have undergone at least one mutation (segregating sites) using GISAID data collected in February 2021. If the mutation rate were constant along the genome, we would expect segregating sites to be spread uniformly throughout the genome; uneven clustering of the mutations gives an indication of mutation rate heterogeneity. We use a nonparametric method (wavelet decomposition) to estimate P from the observed positions of segregating sites, taking into account the dependence of the mutation rate on the base type of the nucleotide undergoing mutation, which is significant for SARS-CoV-2 (Simmonds, 2020; Koyama et al., 2020). The resulting estimate is shown in Figure 1 . The estimate of P is then used to approximate the distribution of the number of recurrent mutations observed in a sample, using a simulation approach. We simulate the process of mutations falling along the genome until the simulated number of segregating sites matches that observed in the sample; the vector P controls where on the genome each mutation falls. The number of recurrent mutations (instances where mutations fall on the same site multiple times) is recorded. This procedure is repeated for 1 000 000 iterations and a histogram of the results is constructed. The resulting probabilities and corresponding p-values are shown in the third and fourth columns of Tables 1a-1d. 3.1.1. False positives. The accuracy of the presented method depends on an assumption that there are no highly homoplasic sites (arising either due to selection or repeated sequencing errors) that have not been masked. If this assumption were violated, the estimate P would be missing 'spikes' of high probability at the corresponding positions, biasing the simulated null distribution to underestimate the number of recurrent mutations, and potentially leading to false positive results. We investigated the validity of this assumption through simulation as described in SI Appendix, Section S4.3.2, by inflating the mutation probability of a subset of 0 to 200 sites in the vector P by a factor H, simulating data with the resulting mutation rate map in the absence of recombination with parameters that appear reasonable for SARS-CoV-2, and checking whether our method would (incorrectly) reject the null hypothesis. The results are presented in the left panel of Figure 2 . False positives were seen in only 0.5% of cases when there are no highly homoplasic sites, demonstrating that our method conservatively overestimates the computed p-values. The proportion of false positives only increases significantly when a large number of extremely homoplasic sites is present, showing that our method is reasonably robust to violations of this assumption. Particularly as we apply a conservative strategy in masking sites known to be homoplasic, this shows that our method is unlikely to falsely indicate the presence of recombination. 3.1.2. Detectable recombination rate. The power of our test in detecting the presence of recombination was investigated for a range of recombination rates ρ, by simulating datasets as described in SI Appendix, Section S4.3.3, and recording how often the null hypothesis of no recombination could be rejected (with p < 0.05). The results are shown in the right panel of Figure 2 , demonstrating that this occurred in 4.5% of cases for ρ = 1 · 10 −7 per site per generation, rising to 99.5% of cases for ρ = 1 · 10 −5 . The simulations were performed using parameters that appear reasonable for SARS-CoV-2; the results suggest that our method is sufficiently powerful for detecting recombination if the recombination rate is higher than around ρ = 1 · 10 −6 ≈ 4 · 10 −5 per site per year. 3.2. Identification of recombinant sequences. All sequences collected in England in December 2020 -January 2021, labelled as belonging to clade GR in GISAID, were downloaded and processed as described in SI Appendix, Section S1.6. The resulting sample comprises 40 sequences with 276 variable sites. An illustration of the sample is provided in SI Appendix, Figure S4 . Choosing a solution with no recombinations, the sites of fourteen recurrent mutations identified by KwARG are highlighted with red (yellow) crosses, where the recurrent mutations fall on the terminal (internal) branches of the ARG. The sequencing protocol used by the COVID-19 Genomics UK Consortium, the submitters of the data, generates short amplicons of under 400 bp in length, and none of the identified sites of recurrent mutations fall into the same amplicon region, making it less likely that the results are due to sample contamination or other sequencing artifacts. The probability of observing the required T obs = 14 or more recurrent mutations is p = 1 · 10 −6 , which strongly indicates the presence of recombination. Considering the results in Table 1c , three recurrent mutations can have the same effect as six of the identified recombination events (compare row (R, RM ) = (10, 0) with (R, RM ) = (4, 3)), suggesting that recurrent mutation offers a more parsimonious explanation for at least part of the patterns seen in the data. One of these recurrent mutations consistently occurs at site 22 227; the other two can be placed either at the same site 9 693, or at sites 9 693 and 12 067. The probability of observing five or fewer recurrent mutations is 0.97, which suggests that, with high probability, at least two recombination have occurred in the history of the sample. An example of an ARG with two recombination events is shown in Figure 3 . It is striking that eight of the recurrent mutations seen in Figure S4 can be placed in the same sequence E39. Indeed, Figure 3 shows that the corresponding incompatibilities in the data can be resolved by just one recombination event between sequence E40 and a sequence from lineage B.1.1.7; Figure 3 . Example of an ARG for the England (January) dataset. Recombination nodes are shown in blue, labelled with the recombination breakpoint, with the offspring sequence inheriting part of the genome to the left (right) of the breakpoint from the parent labelled "P" ("S"). Recurrent mutations are prefixed with an asterisk. Edge carrying the characteristic mutations of lineage B.1.1.7 is highlighted in red; nodes corresponding to sequences from lineage B.1.1.7 are coloured purple. For ease of viewing, some parts of the ARG have been collapsed into nodes labelled "E...". Edges are labelled by positions of mutations (some mutated sites are not explicitly labelled and are denoted by a dot instead). the corresponding recombination node is shown in bold. The sequence E39 has previously been identified as a possible recombinant by Jackson et al. (2021) , demonstrating that our method can clearly highlight mosaic sequences in addition to quantifying the probability that recombination has occurred in the history of the dataset. Detection of intra-clade recombination. All sequences collected in South Africa in February 2021 were downloaded and processed as described in SI Appendix, Section S1.4. The resulting sample comprises 38 sequences with 151 variable sites, all from the same lineage B.1.351. Initial examination of the solutions identified by KwARG show that at least 8 recurrent mutations are required to construct a valid ARG for this sample in the absence of recombination. However, it was noted that three of these recurrent mutations occur at the same site 28 254. This may imply that the site is highly mutable, which could be due to repeated sequencing errors, or as a consequence of selection. We note that this demonstrates the usefulness of our approach in identifying potentially highly homoplasic sites. This position was masked from the sample before re-running the analysis. The probability of observing the re-calculated value of T obs = 5 or more recurrent mutations is p = 7 · 10 −4 , strongly suggesting the presence of recombination. The probability of observing two or fewer recurrent mutations is 0.97, which indicates that with high probability, at least three recombination events have occurred in the history of the dataset. Detection of inter-clade recombination. All sequences collected in South Africa in November 2020 were downloaded and processed as described in SI Appendix, Section S1.3, to create a sample of 50 sequences with 207 variable sites, with 25 belonging to lineage B.1.351 (labelled SAN1-SAN25), and 25 to other lineages (labelled SAO1-SAO25). An initial run of KwARG demonstrated that, notably, one recurrent mutation occurs at site 28 254, further suggesting that this site is excessively prone to recurrent mutation. This site was therefore masked before re-running the analysis. An illustration of the sample is provided in SI Appendix, Figure S5 . The sites of nine recurrent mutations identified by KwARG are highlighted with red crosses (choosing a solution with no recombinations, and where the recurrent mutations fall on the terminal branches of the ARG). The probability of observing the required T obs = 9 or more recurrent mutations is p = 7 · 10 −6 , strongly suggesting the presence of recombination. The probability of observing three or fewer recurrent mutations is 0.96, which indicates that, with high probability, at least four recombination events have occurred in the history of the dataset. Indeed, Table 1 shows that three recurrent mutations can remove the necessity of six recombination events, suggesting that recurrent mutation offers a more parsimonious explanation than recombination for the remaining incompatibilities in the data. Examination of the KwARG solutions shows that these recurrent mutations consistently occur at sites 4 093, 11 230, and 25 273. An ARG with recurrent mutations at these three sites is shown in Figure 4 ; edges carrying the characteristic mutations of lineage B.1.351 are highlighted in red. The sequences SAO21 and SAO22 carry three and two of the identified nine recurrent mutations, respectively, when recombination is prohibited in reconstructing the genealogy. Both of these sequences carry some of the mutations characteristic of lineage B.1.351; this is demonstrated in Figure 5 , where the two sequences are compared to two other typical sequences from lineage B.1.351. Examination of the KwARG solutions shows that a recombination in Sequence SAO21 just after site 22 812 has the same effect as the recurrent mutations at sites 22 813 and 23 012, and a recombination in Sequence SAO22 just after site 23 011 has the same effect as the recurrent mutations at sites 23 012 and 23 063. This suggests that the patterns of incompatibilities observed in these two sequences are consistent with recombination; a possible sequence of recombination events generating these sequences can be seen in the ARG in 3.5. Identification of sequencing errors due to cross-contamination. All sequences labelled as GISAID clade GR, collected in England in November 2020, were aligned, masked, and processed as detailed in SI Appendix, Section S1.5. The quality criteria detailed in SI Appendix, Section S1.2 were not applied in this case. The resulting sample comprises 80 sequences with 363 variable sites, 40 of which belong to lineage B.1.1.7 (labelled EN1-EN40) and 40 to other lineages (labelled EO1-EO40). The results showed that in the absence of recombination, at least 15 recurrent mutations were required to explain the incompatibilities observed in this sample. However, it was identified that six of these recurrent mutations could be placed in the same sequence EO40, as illustrated in Figure E 7a 8 -3' Accession Ref 174 210 355 376 598 1205 1593 2692 3117 4093 5230 5857 … 12503 15003 15222 17999 19602 20233 20387 21614 21801 22022 22206 22813 22992 23012 23031 23063 23664 24062 24133 25145 25273 25427 … 25814 25904 26456 26645 26681 27131 27152 27504 28254 28895 29440 29541 6. The sequence EO40 appeared to carry some of the mutations carried by sequence EO32, and some of the mutations characteristic of lineage B.1.1.7, strongly suggesting that this sequence was a recombinant. Our findings prompted further investigation by the submitters of this sequence, which revealed the signal to be the result of significant contamination of the genetic sample causing multiple errors in the consensus sequence, rather than a result of intra-host recombination. The sequence has subsequently been removed from GISAID. … 22660 23063 23271 23525 23604 23709 23731 24088 24506 24781 24914 25437 25617 26060 26305 26730 27008 27513 27579 27865 27866 27972 28048 28095 28111 28169 28280 28281 28282 28684 28889 28890 28977 29227 29405 29466 The dataset is illustrated in SI Appendix, Figure S6 . The locations of recurrent mutations identified by KwARG are shows as red and yellow crosses, corresponding to recurrent mutations occurring on the terminal and internal branches of the ARG, respectively. In the absence of recombination, at least T obs = 16 recurrent mutations are required, which has probability p < 1 · 10 −6 , strongly suggesting the presence of recombination. The probability of observing three or fewer recurrent mutations is 0.99, suggesting that at least five recombinations have occurred in the history of the sample. An ARG with five recombination nodes, showing a possible history of the dataset, is shown in SI Appendix, Figure S7 . A group of four identical sequences (M16-M19, shown in purple in Figure S7 ) appear to carry a characteristic set of shared mutations that strongly differentiates them from the other sequences in the sample. Five of the identified recurrent mutations affect this group, occurring in a relatively short stretch of the genome, suggesting that these patterns are indicative of recombination with other sequences in the sample carrying these mutations. Five of the other identified recurrent mutations can be placed in one sequence (M11), which appears to carry a mixture of mutations from the group identified above and other sequences in the sample, which is consistent with recombination. This sequence does not match any others in the dataset, so it is possible that this is the result of sequencing errors or sample contamination. If this sequence is removed from the sample, at least T obs = 9 recurrent mutations are still required to explain the observed incompatibilities, which has probability p < 1 · 10 −6 , still strongly suggesting that recombination is present. This agrees with previous reports of within-host recombination for MERS-CoV ( The method presented in this article offers a clear and principled framework for recombination detection, which can be interpreted as a hypothesis testing approach. We make very conservative assumptions throughout, demonstrating on both real and simulated data that the method achieves a very low rate of false positive results, while offering powerful detection of recombination at even relatively low values of recombination rate. We use nonparametric techniques at each stage, to avoid making assumptions on the process generating the data, and thus circumvent issues with model misspecification. Our method allows us to gain clear insights into the evolutionary events that may have generated the given sequences, offering easily interpretable results. The method detects sequences carrying patterns consistent with recombination, demonstrating its effectiveness as a tool for flagging sequences with distinctive patterns of incompatibilities for further detailed investigation. Our results clearly indicate the presence of recombination in the history of the analysed SARS-CoV-2 sequencing data, suggesting a recombination rate greater than around 4 · 10 −4 per site per year. One of the main limitations of our method is that KwARG does not scale well to large datasets. However, while studies relying on clade assignment and statistics such as linkage disequilibrium have identified that recombination occurs at very low levels ( Recombination can occur when the same host is co-infected by two different strains, which has been noted to occur in COVID-19 patients (Samoilov et al., 2020), and could become more likely with the emergence of more transmissible variants. We note that the potential mosaic sequences we identified in the South Africa sample from November are represented only once in the data. This could be due to a lack of onward transmission, as recombinants are likely to reach a detectable level at a relatively late stage in the infection cycle. It could also indicate that the sequences arose due to either contamination of the sample during processing, or the misassembly of two distinct (non-recombinant) strains present in the same sample, as was identified to be the case for one sequence in the England sample from November. We also note that while we sought to mask any sites known to be highly homoplasic, we cannot rule out that some of the identified recurrent mutations did arise multiple times as a consequence of selection or as a result of repeated sequencing errors. However, we have demonstrated that the solutions presented by KwARG can be examined for the presence of highly mutable sites, and have identified using both samples from South Africa that this appears to be the case for site 28 254 (located proximal to the stop codon of ORF8). Our findings suggest that care should be taken when performing and interpreting the results of analysis based on the construction of phylogenetic trees for SARS-CoV-2 data. The presence of recombination, as well as other factors complicating the structure of the transmission network of the virus, strongly suggests that tree-based models are not appropriate for modelling SARS-CoV-2 genealogies, and inference of evolutionary rates based on such methods may suffer from errors due to model misspecification that are difficult to quantify. Due to the high level of homogeneity between sequences, the effects of recombination will be either undetectable or indistinguishable from recurrent mutation in the majority of cases. However, as genetic diversity builds up over longer timescales, the effects of recombination may become more pronounced. Particularly in light of the recent emergence of new variants, the rapid evolution of the virus through recombination between strains with different pathogenic properties is a crucial risk factor to consider. This highlights the need for continuous monitoring of the sequenced genomes for new variants, to enable the early detection of novel recombinant genotypes, and for further work on the quantification of recombination rates and identification of recombination hotspots along the genome. SARS-CoV-2 sequencing data is publicly available from GISAID at gisaid.org upon free registration. MERS-CoV data is publicly available from the NCBI Virus database at ncbi.nlm.nih. gov/labs/virus. Code used in processing the data and carrying out the analysis is available at github.com/a-ignatieva/sars-cov-2-recombination. Supporting Information S1. Data: SARS-CoV-2 S1.1. Alignment and masking. SARS-CoV-2 sequences were downloaded from GISAID (Elbe & Buckland-Merrett, 2017), filtering for those labelled as complete (>29 000bp), collected from human hosts, and excluding any with more than 5% ambiguous nucleotides and incomplete collection dates. Although SARS-CoV-2 is an RNA virus, we refer to nucleotides by their DNA type for consistency with the sequencing data (i.e. the base type T corresponds to U on the actual SARS-CoV-2 genome). The following sites were masked from the data: • the endpoint regions with a large number of missing nucleotides (1-55bp and 29 804-29 903bp); • 322 further sites identified as problematic by De Maio et al. (2020) (prone to sequencing errors, known to be excessively homoplasic, or otherwise of questionable quality); • any multi-allelic sites. S1.2. Quality criteria. Any sequences failing the following quality criteria were removed: • at most 500 missing nucleotides (excluding start and end of alignment); • at most 1 non-ACTG character; • at most 25 gaps; • no SNP clusters (more than 6 SNPs in a window of 100 nucleotides, excluding known clusters of mutations). Nextclade (Hadfield et al., 2018 , tool available at clades.nextstrain.org) was used to check sampled sequences against these criteria (and it was ensured that any sequences assigned a score of "bad" by the tool were removed). In addition, the 198 sites identified by van Dorp et al. (2020, Supplementary Table S5) as potentially highly homoplasic were masked. S1.3. South Africa (November). All sequences collected in South Africa in November 2020 were downloaded and aligned as described in Section S1.1. Removing 48 sequences flagged by the submitter as containing long stretches of ambiguous nucleotides, and applying the quality criteria in Section S1.2, left a total of 278 sequences. The aligned sequences were split into the datasets SA N (the 177 sequences labelled as belonging to variant 501Y.V2 in GISAID) and SA O (the other 101 sequences). A sample of 25 sequences from each of SA O and SA N was selected at random using SeqKit (Shen et al., 2016) . Masking was carried out as described in Section S1.1; in addition, sites 22 266-22 745 were masked, as many of the sequences contained a large number of ambiguous nucleotides at these positions. No further multi-allelic sites were identified. Of the total 1 125 masked positions, 28 corresponded to segregating sites in the dataset. The resulting sample comprises 50 sequences with 207 variable sites. The corresponding GISAID accession numbers and collection dates are given in Table S1 . S1.4. South Africa (February). All sequences collected in South Africa in February 2021 were downloaded, aligned, and masked as described in Section S1.1, also masking sites 22 266-22 745; no additional multi-allelic sites were identified. The quality filters detailed in Section S1.2 were applied. One sequence in the resulting sample was not from lineage B.1.351 and was removed. Of the total 1 125 masked positions, 17 corresponded to segregating sites. The resulting sample consists of 38 sequences, all from lineage B.1.351, with 151 variable sites. The corresponding GISAID accession numbers and collection dates are given in Table S2 . S1.5. England (November). All sequences labelled as clade GR, collected in England in November 2020, were downloaded and aligned as per Section S1.1. Exact duplicates of sequences in the dataset were removed, to avoid including identical sequences in the sample. The sequences were then split into datasets E N (934 sequences labelled as belonging to lineage B.1.1.7) and E O (the other 2 650 sequences). A sample of 40 sequences from each of E O and E N was then selected at random using SeqKit. Sites were masked as detailed in Section S1.1. Three multi-allelic sites were identified and masked, at positions 12 067, 21 724, and 22 992. Of the total 477 masked positions, 10 corresponded to segregating sites in the dataset. The quality control criteria in Section S1.2 were not applied to this sample. The resulting sample comprises 80 sequences with 363 variable sites. The corresponding GISAID accession numbers and collection dates are given in Table S3 . S1.6. England (January). All sequences labelled as clade GR, collected in England in December 2020 -January 2021, were downloaded and aligned as per Section S1.1. Sites were masked as detailed in Section S1.1, and the quality filters detailed in Section S1.2 were applied. A sample of 38 sequences was selected at random using SeqKit, from among sequences uploaded by the COG UK Consortium; additionally, the sequence EPI ISL 994038 (E39) identified as a potential recombinant by Jackson et al. (2021) , and its potential parent sequence EPI ISL 820233 (E40), were included. Five multi-allelic sites were identified and masked, at positions 21 255, 23 604, 24 914, 28 310, and 29 227. Of the total 660 masked positions, 35 corresponded to segregating sites in the dataset. The resulting sample comprises 40 sequences with 276 variable sites. The corresponding GISAID accession numbers and collection dates are given in Table S4 . Table S5 . KwARG seeks to minimise the number of posited recombination and recurrent mutation events in each solution, and the proportions of the two event types can be controlled by specifying input 'cost' parameters C SE , C RM , C R , and C RR , corresponding to penalties assigned to recurrent mutations on the terminal branches of the ARG, those on internal branches, recombination events, and two consecutive recombination events (which can mimic the effects of gene conversion), respectively. For instance, setting (C SE , C RM , C R , C RR ) = (0.5, 0.51, 1.0, 2.0) is likely to produce solutions with more recurrent mutations than recombinations, as the cost of recurrent mutations is lower, favouring placing recurrent mutations on the terminal branches of the ARG where possible. Recurrent mutations on the terminal branches of the ARG affect only one sequence in the input dataset, so can be examined separately for indications that they arose due to errors in the sequencing process. KwARG implements a method of randomly exploring the space of possible ARGs, so it should be run multiple times for each configuration of input parameters, and the best identified solutions (with the minimal number of posited recombinations and/or recurrent mutations) then selected for analysis. An input parameter T (the 'annealing temperature') controls the extent of this random exploration. S4.1. Distribution of the number of recurrent mutations. Let M be the length of the genome, and let m be the number of observed variable sites in the sample. We are interested in estimating the distribution of the number of recurrent mutations that have occurred; that is, the excess number of mutation events beyond the minimum m needed to explain the variability in the sample. Regardless of any modelling assumptions on the evolution of a given sample or the genealogical relationships between the sequences, it is clear that at least m mutation or sequencing error events must have occurred in the history of the sample (here, a 'sequencing error' refers to the variant at a site being incorrectly called during the sequencing process). Suppose that each time such an event occurs (disregarding which particular sequence is affected), a position on the genome is selected at random with replacement, according to a probability vector P of length M. This corresponds to assuming that (i) such events occur independently from each other, (ii) all sequences have the same probabilities P of a mutation or sequencing error event occurring at each particular site. Moreover we assume that (iii) if a site undergoes at least one mutation in the history of the sample, the site is segregating in the data; and (iv) any sequencing errors fall on each site with probability proportional to P . The validity of these assumptions is discussed below in Section S4.3. The number of recurrent mutations in a sample with m variable sites can then be simulated using Algorithm 1. This is a 'balls-into-bins' type simulation, in which balls are placed one-by-one into M bins, each time selecting a bin at random with probability proportional to P , until m bins contain at least one ball; the output is the total number of balls thrown minus m. Executing Algorithm 1 multiple times and calculating a histogram of the results gives an approximation to the distribution of the number of recurrent mutations given the number m of observed segregating sites. S4.2. Mutation rate heterogeneity along the genome. Parts of the genome with a relatively higher mutation rate are more likely to undergo recurrent mutation, so it is important to incorporate the effects of mutation rate heterogeneity. We use an empirical estimate of mutation density to approximate the variation in mutation rate along the genome. S4.2.1. Data. All 17 908 sequences in GISAID collected around the world between 1 and 3 February 2021 were downloaded, filtering for sequences labelled as complete (>29 000bp), high coverage, and excluding any with more than 5% ambiguous nucleotides. Alignment was performed as described in Section S1.1. SNP-sites (Page et al., 2016) was used to extract the positions of the 13 747 identified SNPs; a vector P of length 29 903 was then formed, with a 1 entry at position i if there had been at least one mutation at position i of the genome, and 0 otherwise. If the mutation rate is constant along the genome, we would expect the 1's to be spread uniformly throughout P ; uneven clustering of the mutations gives an indication of mutation rate heterogeneity. We note that an alternative approach would be to fit a tree to the sequencing data (using maximum likelihood, for instance), count the minimum number of mutations required at each site of the genome, and use this to estimate P . However, this was found to result in very noisy estimates, and provide worse quantification of mutation rate heterogeneity (which we confirmed through simulation studies). Smoothing. The mutation density along the genome was then estimated nonparametrically from P by smoothing using wavelet decomposition, as implemented in the R package wavethresh (Nason et al., 2010) . This method was chosen as it does not require selecting a particular model, and it captures both fine-scale and broad variation in mutation density, allowing for the calculation of a smoothed estimate of P incorporating both local and large-scale rate heterogeneity. Briefly, wavelet decomposition can be used to obtain an estimate of a signal from a set of discrete observations, by analysing variation in the data at increasingly coarser scales (Nason, 2008) . Given M = 2 n observations of sites, corresponding to the entries of P (padding the vector P to the nearest power of 2 by reflecting the data at the endpoints), n iterations are performed, and at the i-th iteration (1) coefficients are computed using (non-overlapping) subsets of 2 i neighbouring observations, and (2) these coefficients are used to refine a smoothed estimate of the data. The computation of coefficients and the smoothed approximations is governed by the choice of wavelet shape; we use Daubechies' least-asymmetric wavelets (Daubechies, 1988) with six vanishing moments (other choices of wavelet basis produced similar results). Wavelet shrinkage can be used to obtain a smoothed estimate of the observations and remove noise: coefficient selection is performed by only keeping coefficients with values above a certain threshold and setting the others to zero. There are myriad ways of calculating such a threshold (Nason, 2008) ; we apply the empirical Bayes method of Johnstone & Silverman (2005b) implemented in the R package EbayesThresh (Johnstone & Silverman, 2005a) . As the mutation rate is dependent on the base type of the nucleotide undergoing mutation (Simmonds, 2020; Koyama et al., 2020), P was split into four parts by the corresponding base type in the reference sequence, and the wavelet decomposition and thresholding performed separately for each part before joining them back together. The resulting smoothed estimate P is shown in Figure 1 . The total estimated mutation probability for each base type closely matches the actual proportion of mutations that fall on sites of each base type in the data, as desired. The smoothing method has clearly identified both localised and long-range variation in mutation density along the genome. To check consistency of the results across time periods, data from September-November 2020 was also used to produce smoothed estimates of P (consisting of 41 376 sequences with 14 263 variable sites). The resulting estimate was found to agree closely to that obtained using the February data, so the latter was used in further analysis. S4.3. Validity of assumptions. We now return to consider the validity of the assumptions stated in Section S4.1. Assumption (i) appears reasonable for the data at hand. Assumption (iii) can be violated if a mutation arising on a branch of the genealogy subsequently reverses through recurrent mutation: either on the same branch before it splits, or independently on every child branch subtending the mutation. We note that the probability of such events depends on the distribution of branch lengths in the genealogy; simulations using the standard coalescent model show that the probability of such events is small. Moreover, such events can never create incompatibilities in the data, so we can ignore their possibility for our purposes, as the solutions identified by KwARG will never include such recurrent mutation events. Regarding assumption (ii), as the mutation rates depend on the base type, we cannot claim that all sequences have exactly the same probabilities P of mutating at each particular site, as this will depend on the nucleotides carried by the sequence. However, we estimate the effect of this violation to be negligible, given the relatively low overall rate of mutation for SARS-CoV-2. To make our approximation even more conservative, we increase m by adding back the number of masked segregating sites (which are as stated in Sections S1.3 to S1.6), and further multiply the number of sites by a penalty factor of F = 1.1, which is justified in Section S4.3.1 below. Thus, we address assumption (iv) by noting that we have masked sites that are excessively prone to sequencing errors in the data, so correspondingly we decrease M by the number of masked sites and delete the corresponding entries from P . It is then reasonable to assume that sequencing errors occurring at the non-masked sites fall at each site with the same probabilities as mutations. The effects of this assumption being violated are explored further in Section S4.3.2. S4.3.1. Choice of penalty factor. As noted above, the number of segregating sites in the sample is multiplied by a penalty factor F before performing the simulations. This results in a larger number of recurrent mutations being simulated, skewing the distribution to the right and thus ensuring that the p-values calculated from the simulated distribution are reasonably conservative. This is necessary because, as with any regression method that aims to (partially) de-noise the data, there is a risk that the fitted curve underestimates the true mutation rate heterogeneity, which would result in the expected number of recurrent mutations being underestimated, leading to false positives. The choice of F = 1.1 was validated through simulation studies. First, we simulate a "true" mutation rate map P true across 29,903 sites, as a realisation of an autoregressive process. Then, we simulate 20 000 mutations falling on the genome (allowing sites to mutate multiple times), and recreate the vector P by marking which sites had (or had not) undergone at least one mutation. The method described in Section S4.2 is then applied to fit an estimated mutation density P fit . Finally, 10 000 simulations of Algorithm 1 are used to get an estimate of the null distribution: first, using P true with m ∈ {100, 300, 500} sample segregating sites, then using P sim with m · F sample segregating sites, for F ∈ {1.0, 1.1, 1.2, 1.3, 1.4, 1.5}. This procedure was repeated 500 times for each combination of m and F . The results are presented in Figure S1 . This demonstrates that without the penalty term, the fitted mutation density may indeed fail to capture all of the mutation rate heterogeneity that is present; for instance, when considering a sample with 300 segregating sites, in 46% of cases the 95th percentile of the simulated distribution will be lower than that of the true distribution. The results demonstrate that a value of F = 1.1 appears sufficient to negate this effect, without excessively increasing the false negative rate. Figure S1 . Comparison of simulated null distributions using P true and P fit . Points show the difference between the true and simulated median (left panel), 90th percentile (middle), 95th percentile (right), with size proportional to the number of observations, split by penalty factor F (x-axis) and the number of sample segregating sites m (colours). Ideally, the points should be concentrated around 0; values above (below) 0 may result in false positives (false negatives) when using the estimated null distribution. Percentages show the proportion of cases lying above 0. S4.3.2. Presence of highly homoplasic sites. Violations of assumption (iv) can occur if some (nonmasked) sites along the genome are highly homoplasic, which can occur due to the effects of selection, or as an artifact of the sequencing process. Our method will not pick up the spikes in the corresponding positions of P , potentially introducing bias to the simulated null distribution that could lead to false positives. We investigate the extent to which a violation of assumption (iv) affects the resulting inference through simulation studies. For each i ∈ {0, 1, 5, 10, 20, 50, 100, 200}, i sites of the genome are chosen, and the corresponding probabilities in P are multiplied by a factor H ∈ {2, 5, 10, 20, 50} to give the vectors P i,H . This recreates the effect of having i sites which are highly homoplasic (with the extent of this controlled by H); an example of P 50,2 is shown in Figure S2 . For each combination of i and H, 200 datasets of 80 sequences were simulated using msprime (Kelleher et al., 2016) , with parameters that appear reasonable for SARS-CoV-2: Figure S2 . Mutation density estimate P adjusted by selecting i = 50 sites and multiplying the corresponding entry of P by H = 2 (resulting values shown in orange), to recreate the presence of 50 highly homoplasic sites. • N e = 1 · 10 6 , exponential growth rate of 1.5 (no appropriate published estimates of these parameters could be identified, but this choice was found to give reasonable values of MRCA time and number of segregating sites for the simulated datasets); • binary mutation model (finite sites); • mutation rate per site per generation given by the entries of P i,H × 2 · 10 −5 × 29 903. This was calculated based on: 1.1, ∞, ∞) , (0.01, 0.02, 1.00, 2.00)}) to calculate the minimal number of recurrent mutations needed to explain the dataset in the absence of recombination. A p-value was then calculated, using the null distribution simulated using the un-adjusted vector P and 10 000 iterations of Algorithm 1, with m set to the number of segregating sites in the dataset multiplied by the penalty factor F = 1.1. The proportion of times the null hypothesis is (incorrectly) rejected, with p < 0.05, is shown in the left panel of Figure 2 . S4.3.3. Detection rate vs recombination rate. We investigate how the proportion of cases in which the null hypothesis is rejected (with p < 0.05) varies with recombination rate. For several values of the recombination rate 1 · 10 −7 ≤ ρ ≤ 1 · 10 −5 (per site per generation), 200 datasets were simulated using msprime with the parameters given in Section S4.3.2 (using the un-adjusted vector P ), and the same method used to calculate a p-value for each dataset. The results are presented in the right panel of Figure 2 . The same methodology as described in Section S4 was used to simulate the null distribution for MERS-CoV. Sequences were downloaded from the NCBI Virus database, filtering for those of length at least 20 000 bp, from human and camel hosts, across all time periods. Alignment to the reference sequence was performed as described in Section S2. The alignment comprised 700 sequences with 14 238 variable sites. The vector P was constructed, and wavelet decomposition was used to fit the estimate P in the same manner as described in Section S4.2; the result is shown in Figure S3 . With the resulting estimate P and the parameters given in Section S6, 1 000 000 iterations of Algorithm 1 were used to simulate the null distribution. The resulting probabilities and p-values are shown in the third and fourth columns of Table 1 604 913 1163 1392 1469 2110 3256 3267 4002 5220 5388 5622 5986 6286 6954 8683 9641 9693 10523 11025 12067 12162 13261 13536 13849 14120 14202 14676 15279 15406 16176 17215 17353 17615 19020 19542 19718 21306 21637 21778 21974 21993 22227 22388 22917 22992 23063 23271 23709 24106 24506 24904 24912 25912 26060 26681 26741 26801 27970 27972 28048 28068 28095 28111 28169 28280 28281 28282 28291 28977 29311 29402 29466 29645 29730 29771 10 ORF1b Sequence Figure S4 . Issues with SARS-CoV-2 sequencing data MERS-CoV recombination: implications about the reservoir and potential for adaptation Data, disease and diplomacy: GISAID's innovative contribution to global health The coronavirus proofreading exoribonuclease mediates extensive viral recombination An ancestral recombination graph in Progress in population genetics and human evolution Virus Variation Resource-improved response to emergent viral outbreaks Gene genealogies, variation and evolution: A primer in coalescent theory Statistical properties of the number of recombination events in the history of a sample of DNA sequences Parsimonious reconstruction of ancestral recombination graphs with recurrent mutation Recombinant SARS-CoV-2 genomes involving lineage B.1.1.7 in the UK Variant analysis of SARS-CoV-2 genomes Detecting recombination from gene trees A coalescent-based method for detecting and estimating recombination from gene sequences Phylogenetic and phylodynamic analyses of SARS-CoV-2 Evaluation of methods for detecting recombination from DNA sequences: Computer simulations The effect of recombination on the accuracy of phylogeny estimation Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations No detectable signal for ongoing genetic recombination in SARS-CoV-2. bioRxiv Orthonormal bases of compactly supported wavelets Issues with SARS-CoV-2 sequencing data Data, disease and diplomacy: GISAID's innovative contribution to global health Nextstrain: Real-time tracking of pathogen evolution Virus Variation Resource-improved response to emergent viral outbreaks Recombinant SARS-CoV-2 genomes involving lineage B.1.1.7 in the UK EbayesThresh: R and S-Plus programs for empirical Bayes thresholding Empirical Bayes selection of wavelet thresholds MAFFT multiple sequence alignment software version 7: Improvements in performance and usability Efficient coalescent simulation and genealogical analysis for large sample sizes Variant analysis of SARS-CoV-2 genomes Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia Wavelet methods in statistics with R Wavelets statistics and transforms SNP-sites: Rapid efficient extraction of SNPs from multi-FASTA alignments SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation Rampant C→U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses: Causes and consequences for their short-and long-term evolutionary trajectories. mSphere 5 Rows correspond to sequences, labelled on the left. Columns correspond to positions along the genome; uninformative sites (with all 0's or 1's) and those with singleton mutations (with exactly one 0 or 1) are not shown. Light blue and dark blue denote differing allele types. Red crosses highlight sites of recurrent mutations identified by KwARG located on the terminal branches of the ARG (affecting only one sequence). Yellow crosses highlight recurrent mutations on internal branches We thank the maintainers and contributors of the GISAID database; a full table of acknowledgements for the data used is provided at github.com/a-ignatieva/sars-cov-2-recombination/ GISAID_acknowledgements. We thank Oliver Pybus for help with identifying quality issues affecting sequence EO40. This work was supported by the EPSRC and MRC OxWaSP Centre for Doctoral Training (EPSRC grant EP/L016710/1), and by the Alan Turing Institute (EPSRC grant EP/N510129/1). 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 M17 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 M18 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 M19 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1