key: cord-0865898-agsnzpjw authors: Müller, Nicola F.; Kistler, Kathryn E.; Bedford, Trevor title: Recombination patterns in coronaviruses date: 2022-02-08 journal: bioRxiv DOI: 10.1101/2021.04.28.441806 sha: a739453be23009f08941aa2e43b16d3ffa00576f doc_id: 865898 cord_uid: agsnzpjw As shown during the SARS-CoV-2 pandemic, phylogenetic and phylodynamic methods are essential tools to study the spread and evolution of pathogens. One of the central assumptions of these methods is that the shared history of pathogens isolated from different hosts can be described by a branching phylogenetic tree. Recombination breaks this assumption. This makes it problematic to apply phylogenetic methods to study recombining pathogens, including, for example, coronaviruses. Here, we introduce a Markov chain Monte Carlo approach that allows inference of recombination networks from genetic sequence data under a template switching model of recombination. Using this method, we first show that recombination is extremely common in the evolutionary history of SARS-like coronaviruses. We then show how recombination rates across the genome of the human seasonal coronaviruses 229E, OC43 and NL63 vary with rates of adaptation. This suggests that recombination could be beneficial to fitness of human seasonal coronaviruses. Additionally, this work sets the stage for Bayesian phylogenetic tracking of the spread and evolution of SARS-CoV-2 in the future, even as recombinant viruses become prevalent. SARS-CoV-1. The common ancestor timings of P2S across the genome are equal between RaTG13 and SARS-139 CoV-2 (Fig. S3C ). RaTG13 on the other hand is more closely related to SARS-CoV-2 than P2S (Fig. S3B ) 140 across the entire genome. 141 When looking at when different viruses last shared a common ancestor anywhere on the genome (in other 142 words: when the ancestral lineages of two viruses last crossed paths), we find that RmYN02 has the most recent 143 MRCA with SARS-CoV-2 (Fig. S3C) . The median estimate of the most recent MRCA between SARS-CoV-2 144 and RmYN02 is 1986 (95% CI: 1973 CI: -2005 . For RaTG13 it is 1975 (95% CI: 1988 -1964 , for P2S it is 1949 145 (95% CI: 1907 CI: -1973 and with SARS-CoV-1 it is 1834 (95% CI: . These estimates are contingent on 146 a fixed evolutionary rate of 5 × 10 −4 per nucleotide per year. Rates of recombination vary with rates of adaptation in human seasonal coron-148 aviruses 149 We next investigate recombination patterns in MERS-CoV, which has over 2500 confirmed cases in humans, 150 as well as in human seasonal coronaviruses 229E, OC43 and NL63, which have widespread seasonal circulation 151 in humans. As for the SARS-like viruses, we jointly infer recombination networks, rates of recombination and 152 population sizes for these viruses. We assumed that the genomes evolved under a GTR+Γ 4 model and, in 153 contrast to the analysis of SARS-like viruses, inferred the evolutionary rates. We observe frequent recombination 154 in the history of all 4 viruses, wherein genetic ancestry is described by network rather than a strictly branching phylogeny ( Fig. 2A-D) . 156 The human seasonal coronaviruses all have recombination rates around 1 × 10 −5 per site and year (Fig. S7) . 157 This is around 10 to 20 times lower than the evolutionary rate (Fig. S8 ). In contrast to the recombination rates, 158 the evolutionary rates vary greatly across the human seasonal coronaviruses, with rates between a median of 159 1.3 × 10 −4 (CI 1.1 − 1.5 × 10 −4 ) for NL63 and median rate of 2.5 × 10 −4 (CI 2.2 − 2.7 × 10 −4 ) and 2.1 × 10 −4 160 (CI 1.9 − 2.3 × 10 −4 ) for 229E and OC43 (Fig. S8 ). These evolutionary rates are substantially lower than those 161 estimated for SARS-CoV-2 (1.1×10 −3 substitutions per site and year (Duchene et al., 2020) ), which are more in 162 line with our estimates for the evolutionary rates of MERS with a median rate of 6.9×10 −4 (CI 6.0−7.9×10 −4 ). Evolutionary rate estimates can be time dependent, with datasets spanning more time estimating lower rates of 164 evolution that those spanning less time (Duchêne et al., 2014) . In turn, this means that the evolutionary rates 165 estimates for SARS-CoV-2 will likely be lower the more time passes. It is unclear though, it will approximate 166 the evolutionary rates of other seasonal coronaviruses. On a per-lineage basis the estimated recombination rate for seasonal coronaviruses translates into around The evolutionary purpose of recombination in RNA viruses, as well as whether recombination provides 184 a fitness benefit is unclear (Simon-Loriere and Holmes, 2011). To investigate whether recombination benefits 185 fitness in human coronaviruses, we next tested whether rates of recombination differed on different parts of 186 the genome. To do so, we allowed for different relative rates of recombination within the region 5' of spike 187 (i.e. mostly ORF1ab), spike itself, and everything 3' of spike. We computed recombination rate ratios on each 188 of these 3 sections of the genome as the recombination rate on that section divided by the mean rate on the 189 other two sections. We infer that recombination rates are elevated in the spike protein of all human seasonal 190 coronaviruses considered here (Fig. 3 , S10 & S11). This is consistent with other work estimating higher rates of 191 recombination on the spike protein of betacoronaviruses (Bobay et al., 2020) . 192 We next tested whether recombination rates are elevated on parts of the genome that also show strong signs 193 of adaptation. To do so, we computed the rates of adaption on different parts of the genomes of the seasonal 194 human coronaviruses using the approach described in (Bhatt et al., 2011) and Kistler and Bedford (2021) . This approach does not explicitly consider trees to compute the rates of adaptation on different parts of the 196 genomes and is not affected by recombination (Kistler and Bedford, 2021) . We find that sections of the genome 197 with relatively higher rates of adaptation correspond to sections of the genome with relatively higher rates of 198 recombination (Fig. 3 ). In particular, recombination and adaptation are elevated on the section of the genome 199 that codes for the spike protein and are lower elsewhere. 200 We next investigated whether these trends hold when looking only at spike. The spike protein is made up of 201 two subunits. Subunit 1 (S1) binds to the host cell receptor, while subunit 2 (S2) facilitates fusion of the viral and 202 cellular membrane (Walls et al., 2020) . S1 contains the receptor binding domain and rates of adaptation have been shown to be high in S1 for 229E and OC43 (Kistler and Bedford, 2021). While the rates of adaptation are 204 relatively low overall for NL63, there is still some evidence that they are elevated in S1 compared to S2 (Kistler 205 and Bedford, 2021). To test whether recombination rates vary with rates of adaptation on the subunits of spike as well, we 207 inferred the recombination rates from spike only, allowing for different rates of recombination on S1 versus the 208 rest of spike. We find that the rates of recombination are elevated on S1 for 229E and OC43 compared to the 209 rest of the spike gene (Fig. 3 ). This is consistent with strong absolute rates of adaptation on S1 on these two 210 viruses. For NL63, we find weak evidence for the rate on S2 to be slightly higher than on S1 (Fig. 3) , even 211 though the rates of adaptation are inferred to be higher on S1. The absolute rate of adaptation in S1 of NL63 212 is, however, substantially lower than for 229E or OC43. Additionally, the uncertainty around the estimates 213 on adaption rate ratios between the two subunits for NL63 are rather large and include no difference at all. Overall, these results suggest that recombination events are either positively or negatively selected for. Elevated 215 rates of recombination in areas where adaptation is stronger have been described for other organisms (reviewed 216 here (Nachman, 2002)). Alternatively, higher rates of recombination could also be due to mechanistic reasons, 217 as has been suggested in the case of SARS-CoV-2 (Turakhia et al., 2021). To further investigate this, we next computed the rates of recombination on fitter and less fit parts of the 219 recombination networks of 229E, OC43 and NL63. To do so, we first classify each edge of the inferred posterior 220 distribution of the recombination networks into fit and unfit based on how long a lineage survives into the future. Fit edges are those that have descendants at least one, two, five or ten years into the future and unfit edges Figure 3 : Comparison of recombination rates with rates of adaptation on different parts of the genomes of seasonal human coronaviruses 229E, OC43 and NL63. Relationship between estimated relative recombination rate (x-axis) and relative adaptation rate (y-axis) for three different seasonal human coronaviruses: 229E, OC43 and NL63. These estimates are shown for different parts of the genome, indicated by the different colors. These result from two different types of analysis: one using spike only (subunit 1 over subunit 2, shown in yellow) and one using the full genome (shown in orange, blue and green). The rate ratios denote the rate on a part of the genome divided by the average rate on the two other parts of the genome. those that do not. We then computed the rates of recombination on both types of edges for the entire posterior 223 distribution of networks. Overall, we do not find that fit edges show relatively higher rates of recombination (see 224 figure S9 ). The simplest explanation is that we do not have enough data points to measure recombination rates 225 on unfit edges, meaning to measure recombination rates on part of the recombination network where selection 226 had too little time to shape which lineages survive and which go extinct. An alternative explanation to why 227 we see elevated rate or recombination in the spike protein, but do not observe a population level fitness benefit 228 could be that most (outside of spike) recombinants could be detrimental to fitness with few (within spike) having 229 little fitness effect at all. Though not yet highly prevalent, evidence for recombination in SARS-CoV-2 has started to appear (VanIns- While their impact on the evolutionary dynamics of SARS-CoV-2 remains unclear, the likely rise of future 241 SARS-CoV-2 recombinants will further necessitate methods that allow phylogenetic and phylodynamic infer-242 ences to be performed in the presence of recombination (Neches et al., 2020) . In absence of that, recombination 243 has to be either ignored, leading to biased phylogenetic and phylodynamic reconstruction (Posada and Cran-244 dall, 2002), or non-recombinant parts of the genome have to be used for analyses, reducing the precision of 245 these methods. Our approach addresses this gap by providing a Bayesian framework to infer recombination 246 networks. To facilitate easy adaptation, we implemented the method so that analyses can be set up follow-247 ing the same workflow as regular BEAST2 (Bouckaert et al., 2018) analyses. Extending the current suite of 248 population dynamic models, such as birth-death models (Stadler, 2009) In order to perform joint Bayesian inference of recombination networks together with the parameters of the associated models, we use a MCMC algorithm to characterize the joint posterior density. The posterior density is denoted as: where N denotes the recombination network, µ the evolutionary model, θ the effective population size and ρ 270 the recombination rate. The multiple sequence alignment, that is the data, is denoted D. P (D|N, µ) denotes 271 the network likelihood, P (N |θ, ρ), the network prior and P (µ, θ, ρ) the parameter priors. As is usually done in 272 Bayesian phylogenetics, we assume that P (µ, θ, ρ) = P (µ)P (θ)P (ρ). Network Likelihood While the evolutionary history of the entire genome is a network, the evolutionary history of each individual position in the genome can be described as a tree. We can therefore denote the likelihood of observing a sequence alignment (the data denoted D) given a network N and evolutionary model µ as: with D i denoting the nucleotides at position i in the sequence alignment and T i denoting the tree at position i. The likelihood at each individual position in the alignment can then be computed using the standard pruning 276 Figure 4 : Example recombination network. Events that can occur on a recombination network as considered here. We consider events to occur from present backwards in time to the past (as is the norm when looking at coalescent processes). Lineages can be added upon sampling events, which occur at predefined points in time and are conditioned on. Recombination events split the path of a lineage in two, with everything on one side of a recombination breakpoint going in one direction and everything on the other side of a breakpoint going in the other direction. algorithm (Felsenstein, 1981 The network prior is denoted by P (N |θ, ρ), which is the probability of observing a network and the embedding 284 of segment trees under the coalescent with recombination model, with effective population size θ and per-lineage 285 recombination rate ρ. It essentially plays the same role that tree prior plays in standard phylodynamic analyses. 286 We can calculate P (N |θ, ρ) by expressing it as the product of exponential waiting times between events (i.e., recombination, coalescent, and sampling events): where we define t i to be the time of the i-th event and L i to be the set of lineages extant immediately prior to 287 this event. (That is, Given the coalescent process is a constant size coalescent and given the i-th event is a coalescent event, the event contribution is denoted as: If the i-th event is a recombination event and assuming constant rates of recombination over time, the event contribution is denoted as: The interval contribution denotes the probability of not observing any event in a given interval. It can be computed as the product of not observing any coalescent, nor any recombination events in interval i. We can therefore write: where λ c denotes the rate of coalescence and can be expressed as: and λ r denotes the rate of observing a recombination event on any co-existing lineage and can be expressed as: In order to allow for recombination rates to vary across s sections S s of the genome, we modify λ r to differ in each section S s , such that: with L(l) ∩ S s denoting denoting the amount of overlap between L(l) and S s . The recombination rate in each 289 section s is denoted as ρ s . In order to explore the posterior space of recombination networks, we implemented a series of MCMC operators. Gibbs operator. The Gibbs operator efficiently samples any part of the network that is older than the 308 root of any segment of the alignment and is thus not informed by any genetic data. Empty loci preoperator. The empty segment preoperator augments the network with edges that do not 310 carry any loci for the duration of a move, to allow larger jumps in network space. One of the issues when inferring these recombination networks is that the root height can be substantially alignment. We then tested whether we are able to infer recombination networks, recombination rates, effective population 321 sizes and evolutionary parameters from simulated data. To do so, we randomly simulated recombination networks 322 under the coalescent with recombination. On top of these, we then simulated multiple sequence alignments. 323 We then re-infer the parameters used to simulate using our MCMC approach. As shown in Figure S13 , these 324 parameters are retrieved well from simulated data with little bias and accurate coverage of simulated parameters 325 by credible intervals. 326 We next tested how well we can retrieve individual recombination events. To do so, we plot the location 327 and timings simulated recombination events for the first 9 out of 100 simulations. We then plot the density Jukes Cantor model with 100 leafs and a recombination rate drawn randomly from a log-normal distribution. We 344 then infer the recombination rates using 5 different recombination rate priors as shown in figure S15F that put 345 some or a lot of weight on the wrong parameters. As shown in figures S15A-E, we are able to infer recombination 346 rates, even with the wrong priors. works for the MERS spike protein to treating the evolutionary histories as trees. We find that although the 349 effective sample size values are lower when inferring recombination networks, they are not orders of magnitude 350 lower (see fig S17) . Recombination network summary 352 We implemented an algorithm to summarize distributions of recombination networks similar to the maximum 353 clade credibility framework typically used to summarize trees in BEAST (Heled and Bouckaert, 2013) . In short, Figure S1 : Recombination rate ratios of SARS-like viruses on different parts of the genome. Recombination rate ratios for SARS-like viruses based on two different analyses: one using the full genome (left) and one using the spike protein only (right). The rate ratios denote the rate on a part of the genome divided by the average rate on the two other parts of the genome. s1 over s2 denotes the rate ratio on spike subunit 1 over subunit 2. Fig. 1A . The different colors represent whether a local trees was towards the 5' or 3' end relative to the region that codes for the spike protein, or whether it was on spike itself. Figure S9 : Recombination rates of different parts of the recombination networks. Recombination rates are computed for different parts of the network based on how long lineages persist for into the future. For this analysis, we classified each edge of the recombination network in the posterior distribution into fit and unfit. Fit are edges that persist for at least 1, 2, 5 or 10 years into the future (plots from left to right). We compute the rates of recombination on these edges as well as on those who go extinct more rapidly. We repeat the same for posterior predictive recombination networks that we simulated from the given sampling times, the inferred effective population sizes and the inferred recombination rates under the coalescent with recombination. Figure S12 : Comparison of network statistics when simulating under the coalescent with recombination compared to sampling under the truncated coalescent with recombination. We compare the posterior distributions of network height, length and the number of recombination nodes when simulating recombination networks under th coalescent with recombination and when MCMC sampling under the implementation of coalescent with recombination. We compare this for all the different MCMC operators implemented. For MCMC operators which are not universal (cannot reach every point in the posterior distribution by themselves), we tested the operator jointly with the Add/remove operator. The statistics "above the root" take into account the full distribution of networks. The statistics "below the root" only take into account the parts of the network that are below (more recent) than the oldest root of any individual position in the alignment. These are the parts of the network that directly impact the likelihood. Figure S13 : Inferred vs. true rates based on simulated data. We simulated recombination networks and sequence alignment using the randomly drawn values on the x-axis and then re-inferred these parameters on the y-axis. The size of the cross is scaled by the product of the recombination length and the amount of genetic information that recombined. The contour plot shows the distribution of inferred recombination events by location on the genome and time computed from the inferred posterior distribution of networks. Figure S14 : Inferred vs. simulated recombination events. Here, we show the recombination events for the simulated networks (red cross) with the position on the genome on the x-axis and the timing of the event on the y-axis. The contour plots show the density for inferred recombination events for the first 9 iterations of the simulation study. The time in the past is limited to the duration of sampling, i.e. the time when samples were taken. Figure S15 : Impact of the recombination rate prior distribution on the inferred recombination rates. Here, we compare then inferred recombination rates when using different prior distributions that differed from the distributions from which the rates for simulations were sampled. The rates for simulations were sampled from a log-normal distribution with µ = −11.12 and σ = 0.5. In A, we shows the inferred rates when using a prior distribution with µ = −12.74 and σ = 0.5 (leading to a 5 times lower mean in real space than the correct prior). In B, we shows the inferred rates when using a prior distribution with µ = −12.74 and σ = 2. In C, we shows the inferred rates when using the same prior distribution as was sampled under. In D, we shows the inferred rates when using a prior distribution with µ = −9.72 and σ = 2. In E, we shows the inferred rates when using a prior distribution with µ = −9.72 and σ = 0.5 (leading to a 5 times higher mean in real space than the correct prior). Figure F shows the corresponding density plots for all log normal distributions used as prior distributions on the recombination rates. We additionally provide a tutorial on how to setup and postprocess an analysis at https: References Rasmussen