key: cord-260508-z11exbyu authors: Wang, Hongru; Pipes, Lenore; Nielsen, Rasmus title: Synonymous mutations and the molecular evolution of SARS-Cov-2 origins date: 2020-10-12 journal: bioRxiv DOI: 10.1101/2020.04.20.052019 sha: doc_id: 260508 cord_uid: z11exbyu Human severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is most closely related, by average genetic distance, to two coronaviruses isolated from bats, RaTG13 and RmYN02. However, there is a segment of high amino acid similarity between human SARS-CoV-2 and a pangolin isolated strain, GD410721, in the receptor binding domain (RBD) of the spike protein, a pattern that can be caused by either recombination or by convergent amino acid evolution driven by natural selection. We perform a detailed analysis of the synonymous divergence, which is less likely to be affected by selection than amino acid divergence, between human SARS-CoV-2 and related strains. We show that the synonymous divergence between the bat derived viruses and SARS-CoV-2 is larger than between GD410721 and SARS-CoV-2 in the RBD, providing strong additional support for the recombination hypothesis. However, the synonymous divergence between pangolin strain and SARS-CoV-2 is also relatively high, which is not consistent with a recent recombination between them, instead it suggests a recombination into RaTG13. We also find a 14-fold increase in the dN/dS ratio from the lineage leading to SARS-CoV-2 to the strains of the current pandemic, suggesting that the vast majority of non-synonymous mutations currently segregating within the human strains have a negative impact on viral fitness. Finally, we estimate that the time to the most recent common ancestor of SARS-CoV-2 and RaTG13 or RmYN02 based on synonymous divergence, is 51.71 years (95% C.I., 28.11-75.31) and 37.02 years (95% C.I., 18.19-55.85), respectively. The Covid19 pandemic is perhaps the biggest public health and economic threat that the world 23 has faced for decades (Li, Guan, et al. 2020; ; Zhou, Yang, et al. 2020) . It is 24 caused by a coronavirus (Lu, et al. 2020 ; Zhang and Holmes 2020), Severe acute respiratory The genome of human coronavirus can effectively recombine with other viruses to form a 1 chimeric new strain when they co-infect the same host (Forni, et al. 2017; Boni, et al. 2020 ). 2 Complicated recombination histories have been observed in the receptor binding motif region of 3 the spike protein Xiao, et al. 2020; Zhang, et al. 2020 ) and several other 4 regions (Boni, et al. 2020 ) of the SARS-CoV-2, it is thus important to exhaustively search along 5 the viral genome for other regions potentially of recombination origin and identify possible 6 donors associated with them. To identify possible viral strains that may have contributed, by 7 recombination, to the formation of human SARS-CoV-2, we searched NCBI and EMBL virus 8 entries along with GISAID Epiflu and EpiCov databases for similar sequences using BLAST in 9 100bp windows stepping every 10bp (Fig. 1b) . The majority of the genome (78.1%, 2330/2982 10 of the windows) has one unique best hit, likely reflecting the high genetic diversity of the 11 coronavirus. 21.9% of the genomic regions has multiple best hits, which suggests that these 12 regions might be more conserved. Among the windows with unique best hits, 97.0% (2260/2330) 13 of them were the RaTG13 or RmYN02 bat strains and 1.9% of them, including the ACE2 14 contact residues region of the S protein, were the pangolin SARS-CoV-2 virus. These 15 observations are consistent with previous results that RaTG13 and RmYN02 are the most 16 closely related viral strains, while the region containing the ACE2 contact residues is more 17 closely related to the pangolin virus strain Table 4 ). The mosaic pattern that different regions of 23 the genome show highest identity to different virus strains is likely to have been caused by the 24 rich recombination history of the SARS-CoV-2 lineage (Boni, et al. 2020 ; Li, Giorgi, et al. 2020; in some genomic regions may suggest recombination between the ancestral lineage of SARS-1 CoV-2 and distantly related virus lineages, although more formal analyses are needed to 2 determine the recombination history (see also Boni, et al. 2020 for further discussion). 3 Searching databases with BLAST using the most closely related viral strains, RaTG13 and 4 RmYN02, we observe a very similar pattern, as that observed for SARS-CoV-2, in terms of top 5 hits across the genome (Fig. 1b) , suggesting that these possible recombination events with 6 distantly related lineages are not unique to the SARS-CoV-2 lineage, but happened on the 7 ancestral lineage of SARS-CoV-2, RaTG13, and RmYN02. A notable exception is a large region 8 around the S gene, where RmYN02 show little similarity to both SARS-CoV-2 and RaTG13. 9 10 Sequence similarity and recombination 11 We focus further on studying the synonymous evolution of SARS-CoV-2, and analyzing Wuhan-12 Hu-1 as the human nCoV19 reference strain We performed recombination analyses across the five viral genomes based on the 23 concensus of the seven recombination-detection methods implemented in RDP5 (see Methods). 24 We identified nine recombination regions affecting at least one of the sequences incongruence when compared with genome-wide trees ( Fig. 2 and Supplementary Figure 1-3) . 1 Particularly, a recombination signal is found in a region encompassing the RBD of the S protein, 2 suggesting that the human SARS-CoV-2 (Wuhan-Hu-1) sequence is a recombinant with the 3 Pangolin-CoV (GD410721) as the donor (Supplementary Table 5 ). Phylogenetic analyses also 4 support that Wuhan-Hu-1 and GD410721 form a clade relative to RaTG13 (Supplementary 5 Figure 1c , 1d). Phylogenetic analyses (Fig. 2 ) in genomic regions with all recombination tracts 6 (Supplementary Table 5 ) masked using Maximum-likelihood (Fig. 2a) and Neighbor-joining 7 based on synonymous (Fig. 2b ) or non-synoymous (Fig. 2c ) mutation distance metrics, 8 consistently support RmYN02 as the nearest outgroup to human SARS-CoV-2, in contrast to 9 previous analyses before the discovery of RmYN02, which instead found RaTG13 to be the 10 nearest outgroup ). This observation is also consistent with the 11 genome-wide phylogeny constructed in previous study (Zhou, Chen, et al. 2020). 12 We plot the overall sequence similarity (% nucleotides identical) between SARS-CoV- 2 13 and the four other strains analyzed in windows of 300 bp (Fig. 1) . Notice that the divergences 14 between human SARS-CoV-2 and the bat viral sequences, RaTG13 and RmYN02, in most 15 regions of the genome, are quite low compared to the other comparisons. A notable exception is 16 the suspected recombination region in RmYN02 that has an unusual high level of divergence 17 recombination event with more distantly related viral strains (Fig. 1e) . The other four sequences 1 are all highly, and approximately equally, divergent from RmYN02 in this large region (Fig. 1e) , 2 suggesting that the RmYN02 strain obtained a divergent haplotype from the recombination 3 event. When BLAST searching using 100-bp windows along the RmYN02 genome, we find no 4 single viral genome as the top hit, instead the top hits are found sporadically in different viral 5 strains of the SARS-CoV lineage (Fig. 1f) , suggesting that the sequence of the most proximal 6 donor is not represented in the database. 7 8 While the overall divergence in the S gene encoding the spike protein could suggest the 10 presence of recombination in the region, previous study ) reported that the tree 11 based on synonymous substitutions supported RaTG13 as the sister taxon to the human SARS-12 CoV-2 also in this region. That would suggest the similarity between GD410721 and human 13 SARS-CoV-2 might be a consequence of convergent evolution, possibly because both strains 14 adapted to the use of the same receptor. An objective of the current study is to examine if there 15 are more narrow regions of the spike protein that might show evidence of recombination. We 16 investigate this issue using estimates of synonymous divergence per synonymous site (dS) in 17 sliding windows of 300 bp. However, estimation of d S is complicated by the high levels of 18 divergence and extremely skewed nucleotide content in the 3rd position of the sequences 19 (Table 1) which will cause a high degree of homoplasy. We, therefore, entertain methods for 20 estimation that explicitly account for unequal nucleotide content and multiple hits in the same 21 site such as maximum likelihood methods and the YN00 method (Yang and Nielsen 2000) . It is 22 shown that for short sequences, some counting methods, such as the YN00 method, can 23 perform better in terms of Mean Squared Error (MSE) for estimating d N and d S (Yang and 24 Nielsen 2000). However, it is unclear in the current case how best to estimate d S . For this 25 reason, we performed a small simulations study (see Methods) for evaluating the performance of the maximum likelihood (ML) estimator of d N and d S (as implemented in codeml (Yang 2007) ) 1 under the F3x4 model and the YN00 method implemented in PAML. In general, we find that 2 estimates under the YN00 are more biased with slightly higher MSE than the ML estimate for 3 values in the most relevant regime of d S < 1.5 (Fig. 3) . However, we also notice that both 4 estimators are biased under these conditions. For this reason, we perform a bias correction 5 calibrated using simulations specific to the nucleotide frequencies and d N /d S ratio observed for 6 SARS-CoV-2 (see Methods). The bias corrections we obtain are ^* = ^ + 0.455^2 -0.824^3 7 + 0.264^4, for the ML estimator and ^* = ^+ 1.492^2-3.166^3 + 1.241^4 for yn00. Notice 8 that there is a trade-off between mean and variance ( Fig. 3 ) so that the MSE becomes very 9 large, particularly for the for yn00 method, after bias correction. For d S >2 the estimates are 10 generally not reliable, however, we note that for d S <1.5 the bias-corrected ML estimator tends 11 overall to have slightly lower MSE, and we, therefore, use this estimator for analyses of 300 bp 12 regions. by recombination with more divergent strains. We, therefore, also estimate d N and d S for the 20 regions with inferred recombination tracts (Supplementary Table 5 ) removed from all 21 sequences (Table 3) respectively. This confirms that RmYN02 is the virus most closely related to SARS-CoV-2. The 24 relative high synonymous divergence also shows that the apparent high nucleotide similarity between SARS-CoV-2 and the bat strains (96.2% (Zhou, Yang, et al. 2020) and 97.2%(Zhou, 1 Chen, et al. 2020)) is caused by conservation at the amino acid level (d N / d S = 0.0410 and 2 0.0555) exacerbated by a high degree of synonymous homoplasy facilitated by a highly skewed 3 nucleotide composition at the third position of codons (with an AT content >72%, Table 1 ). 4 The synonymous divergence to the pangolin sequences GD410721 and GX_P1E in 5 genomic regions with inferred recombination tracts removed is 0.5095 (95% C.I., 0.4794-0.5396) 6 and 1.0304 (95% C.I., 0.9669-1.0939), respectively. Values for other comparisons are shown in 7 Tables 2 and 3. In comparisons between SARS-CoV-2 and more distantly related strains, d S will 8 be larger than 1, and with this level of saturation, estimation of divergence is associated with 9 high variance and may be highly dependent on the accuracy of the model assumptions. This 10 makes phylogenetic analyses based on synonymous mutations unreliable when applied to 11 these more divergent sequences. Nonetheless, the synonymous divergence levels seem Differences between estimates larger than 2.0 should not be interpreted strongly, as these 24 estimates have high variance and likely will be quite sensitive to the specifics of the model 25 assumptions. We find that d S (SARS-CoV-2, GD410721) approximately equals d S (GD410721, RaTG13) 1 and is larger than d S (SARS-CoV-2, RaTG13) in almost the entire genome showing than in these 2 parts of the genome GD410721 is a proper outgroup to (SARS-CoV-2, RaTG13) assuming a 3 constant molecular clock. One noticeable exception from this is the RBD region of the S gene. 4 In this region the divergence between SARS-CoV-2 and GD410721 is substantially lower than 5 between GD410721 and RaTG13 (Fig. 4a,4c) . The same region also has much smaller 6 divergence between SARS-CoV-2 and GD410721 than between SARS-CoV-2 and RaTG13 7 (Fig. 4a,4c) . The pattern is quite different than that observed in the rest of the genome, most 8 easily seen by considering the ratio of d S (SARS-CoV-2, GD410721) to d S (SARS-CoV-2, 9 RaTG13) (Fig. 2b, 2d) . In fact, the estimates of d S (SARS-CoV-2, RaTG13) are saturated in this 10 region, even though they are substantially lower than 1 in the rest of the genome. This strongly 11 suggests a recombination event in the region and provides independent evidence of that 12 previously reported based on amino acid divergence (e.g.,(Zhang, et al. 2020)). 13 The combined evidences from synonymous divergence and the topological 14 recombination inference, provide strong support for the recombination hypothesis. However, 15 these analyses alone do not distinguish between recombination into RaTG13 from an unknown 16 source as previously hypothesized (Boni, et al. 2020 ) and recombination between SARS-CoV-2 17 and GD410721 as proposed as one possible explanation by Lam et al. . To 18 distinguish between these hypotheses we searched for sequences that might be more closely 19 related, in the RBD region, to RaTG13 than SARS-CoV-2 and we plotted sliding window 20 similarities across the genome for RaTG13 (Fig. 1c) . We observe relatively low sequence 21 identity between RaTG13 and all three other strains in the ACE2 contact residue region of the 22 spike protein, which is more consistent with the hypothesis of recombination into RaTG13, as 23 proposed in (Boni, et al. 2020 ). Moreover, our BLAST search analyses of RaTG13 in this region 24 show highest local sequence similarity with GX pangolin virus strains which is the genome-wide with the hypothesis of recombination from a virus related to GX pangolin strains, than with 1 recombination between SARS-CoV-2 and GD410721. 2 Unfortunately, because of the high level of synonymous divergence to the nearest 3 outgroup, tree estimation in small windows is extremely labile in this region. In fact, synonymous 4 divergence appears fully saturated in the comparison with GX_P1E, eliminating the possibility to 5 infer meaningful trees based on synonymous divergence. However, we can use the overall 6 maximum likelihood tree using both synonymous and nonsynonymous mutations (Fig. 2d ). The 7 ML tree using sequence from the ACE2 contact residue region supports the clustering of SARS-8 CoV-2 and GD410721, but with unusual long external branches for all strains except SARS-9 CoV-2, possibly reflecting smaller recombination regions within the ACE2 contact residue region. 10 11 The use of synonymous mutations provides an opportunity to calibrate the molecular clock 13 without relying on amino acid changing mutations that are more likely to be affected by selection. 14 The rate of substitution of weakly and slightly deleterious mutations is highly dependent on 15 ecological factors and the effective population size. Weakly deleterious mutations are more 16 likely to be observed over small time scales than over long time scales, as they are unlikely to 17 persist in the population for a long time and go to fixation. This will lead to a decreasing dN/dS 18 ratio for longer evolutionary lineages. Furthermore, changes in effective population size will 19 translate into changes in the rate of substitution of slightly deleterious mutations. Finally, 20 changes in ecology (such as host shifts, host immune changes, changes in cell surface receptor, 21 etc.) can lead to changes in the rate of amino acid substitution. For all of these reasons, the use 22 of synonymous mutations, which are less likely to be the subject of selection than 23 nonsynonymous mutations, are preferred in molecular clock calculations. For many viruses, the 24 use of synonymous mutations to calibrate divergence times is not possible, as synonymous 25 sites are fully saturated even at short divergence times. However, for the comparisons between SARS-CoV-2 and RaTG13, and SARS-CoV-2 and RmYN02, synonymous sites are not 1 saturated and can be used for calibration. We find an estimate of ω = 0.0391 between SARS-2 CoV-2 and RaTG13, excluding just the small RDB region showing a recombination signal in 3 SARS-CoV-2 (Supplementary Table 5 providing an approximate 95% confidence interval of (0.0332, 0.0464). Also, using 59 human 7 strains of SARS-CoV-2 from Genbank and National Microbiology Data Center (See Methods) 8 we obtain an estimate of ω = 0.5604 using the F3x4 model in codeml. Notice that there is a 14-9 fold difference in d N /d S ratio between these estimates. Assuming very little of this difference is 10 caused by positive selection, this suggests that the vast majority of mutations currently 11 segregating in the SARS-CoV-2 are slightly or weakly deleterious for the virus. 12 13 To calibrate the clock we use the estimate provided by (http://virological.org/t/phylodynamic-15 analysis-of-sars-cov-2-update-2020-03-06/420) of =1.04×10 -3 substitutions/site/year (95% CI: 16 0.71x10-3, 1.40x10-3). The synonymous specific mutation rate can be found from this as 17 d S /year = S = /(pS +ωpN), where ω is the d N /d S ratio, and pN and pS are the proportions of 18 nonsynonymous and synonymous sites, respectively. The estimate of the total divergence on 19 the two lineages is then ^= ( + )⁄ . Inserting the numbers from Table 3 for the 20 divergence between SARS-CoV-2 and RaTG13 and RmYN02 ,respectively, we find a total 21 divergence of 96.92 years and 74.05 years respectively. Taking into account that RaTG13 was 22 isolated July 2013, we find an estimated tMRCA between that strain and SARS-CoV-2 of 23 times. The estimate for SARS-CoV-2 and RaTG13 is compatible with the values obtained using 1 different methods for dating (Boni, et al. 2020) . The variance in the estimate in d S is small and 2 the uncertainty is mostly dominated by the uncertainty in the estimate of the mutation rate. We 3 estimate the S.D. in ^ using 1000 parametric simulations, using the ML estimates of all 4 parameters, for both RaTG13 vs. SARS-CoV-2 and for RmYN02 vs. SARS-CoV-2, and for each 5 simulated data also simulating values of and from normal distributions with mean 1.04×10 -3 6 and S.D. 0.18×10 -3 , and mean 0.5604 and S.D. 0.1122, respectively. We subject each 7 simulated data set to the same inference procedure as done on the real data. Our estimate of 8 the S.D. in the estimate is 11.8 for RaTG13 vs. SARS-CoV-2 and 9.41 for RmYN02 vs. SARS-9 CoV-2, providing an approximate 95% confidence interval of (28.11, 75.31) and (18.19, 55 .85), 10 respectively. For RaTG13, if including all sites, except the 244-bp in the RBD of the S gene 11 (Supplementary Table 5 ), the estimate is 55.02 years with an approx. 95% C.I. of (29.4, 80.7). 12 As more SARS-CoV-2 sequences are being obtained, providing more precise estimates of the 13 mutation rate, this confidence interval will become narrower. However, we warn that the 14 estimate is based on a molecular clock assumption and that violations of this assumption 15 eventually will become a more likely source of error than the statistical uncertainty quantified in 16 the calculation of the confidence intervals. We also note that, so far, we have assumed no 17 variation in the mutation rate among synonymous sites. However, just from the analysis of the 18 300 bp windows, it is clear that is not true. The variance in the estimate of dS among 300 bp 19 windows from the RaTG13-SARS-CoV-2 comparison is approximately 0.0113. In contrast, in 20 the simulated data assuming constant mutation rate, the variance is approximately 0.0034, 21 suggesting substantial variation in the synonymous mutation rate along the length of the 22 genome. Alternatively, this might be explained by undetected recombination in the evolutionary 23 history since the divergence of the strains. 24 The highly skewed distribution of nucleotide frequencies in synonymous sites in SARS-CoV-2 1 (Kandeel, et al. 2020), along with high divergence, complicates the estimation of synonymous 2 divergence in SARS-CoV-2 and related viruses. In particular, in the third codon position the 3 nucleotide frequency of T is 43.5% while it is just 15.7% for C. This resulting codon usage is not 4 optimized for mammalian cells (e.g, (Chamary, et al. 2006) ). A possible explanation is a strong 5 mutational bias caused by Apolipoprotein B mRNA-editing enzymes (APOBECs) which can 6 cause Cytosine-to-Uracil changes (Giorgio, et al. 2020). 7 A consequence of the skewed nucleotide frequencies is a high degree of homoplasy in 8 synonymous sites that challenges estimates of d S . We here evaluated estimators of d S in 300 bp 9 sliding windows and found that a bias-corrected version of the maximum likelihood estimator 10 tended to perform best for values of d S < 2. We used this estimator to investigate the 11 relationship between SARS-CoV-2 and related viruses in sliding windows. We show that 12 synonymous mutations show shorter divergence to pangolin viruses, than the otherwise most 13 closely related bat virus, RaTG13, in part of the receptor-binding domain of the spike protein. 14 This strongly suggests that the previously reported amino acid similarity between pangolin 15 viruses and SARS-CoV-2 is not due to convergent evolution, but more likely is due to 16 recombination. In the recombination analysis, we identified recombination from pangolin strains 17 into SARS-CoV-2, which provides further support for the recombination hypothesis. However, 18 we also find that the synonymous divergence between SARS-CoV-2 and pangolin viruses in this 19 region is relatively high, which is not consistent with a recent recombination between the two. It 20 instead suggests that the recombination was into RaTG13 from an unknown strain, rather than 21 between pangolin viruses and SARS-CoV-2, as proposed in (Boni, et al. 2020 ). The alternative 22 explanation of recombination into SARS-CoV-2 from the pangolin virus, would require the 23 additional assumption of a mutational hotspot to account for the high level of divergence in the 24 region between SARS-CoV-2 and the donor pangolin viral genome. To fully distinguish between 25 these hypotheses, additional strains would have to be discovered that either are candidates for introgression into RaTG13 or can break up the lineage in the phylogenetic tree between 1 pangolin viruses and RaTG13. 2 The fact that synonymous divergence to the outgroups, RaTG13 and RmYN02, is not 3 fully saturated, provides an opportunity for a number of different analyses. First, we can date the 4 time of the divergence between the bat viruses and SARS-CoV-2 using synonymous mutations 5 alone. In doing so, we find estimates of 51.71 years (95% C.I., 28.11-75.31) and 37.02 years 6 (95% C.I., 18.19-55.85), respectively. Most of the uncertainty in these estimates comes from 7 uncertainty in the estimate of the mutation rate reported for SARS-CoV-2. As more data is being 8 produced for SARS-CoV-2, the estimate should become more precise and the confidence 9 interval significantly narrowed. We note that the mutation rate we use here are estimated based 10 on the entire genome, which may differ from that in non-recombination regions. To address this 11 problem, we downloaded all the SARS-CoV-2 sequences that are available until 2020-08- 17 12 from GISAID, and obtained an estimate of 1:0.81 for the ratio of mutation rates in the 13 recombination and non-recombination regions, using the "GTRGAMMA" model implemented in 14 the RAxML (Stamatakis 2014). Given the length ratio between the two partitions is 1:4, the 15 difference between the partitions will cause a slight overestimate of the mutation rate by ~5%, 16 which is relatively small compared to the confidence intervals and the potential for other 17 unknown sources of uncertainty. However, we warn that a residual cause of unmodeled 18 statistical uncertainty is deviations from the molecular clock. Variation in the molecular clock 19 could be modeled statistically (see e.g., (Drummond, et al. 2006 ) and (Lartillot, et al. 2016) ), but 20 the fact that synonymous mutations are mostly saturated for more divergent viruses that would 21 be needed to train such models, is a challenge to such efforts. On the positive side, we note that 22 the estimates of d S given in Table 3 very different approach based on including more divergent sequences and applying a relaxed 25 molecular clock. We see the two approaches as being complimentary. In the traditional relaxed molecular clock approach more divergent sequences are needed that may introduce more 1 uncertainty due to various idiosyncrasies such as alignment errors. Furthermore, the relaxed 2 molecular clock uses both synonymous and non-synonymous mutations and is, therefore, more 3 susceptible to the effects of selection. Our approach allows us to focus on just the relevant in-4 group species and to use only synonymous mutations. The disadvantage is that we cannot 5 accommodate a relaxed molecular clock. However, the fact that both approaches provide 6 similar estimates is reassuring and suggests that neither idiosyncrasies of divergent sequences, 7 natural selection, or deviations from a molecular clock has led to grossly misleading conclusions 8 Another advantage of estimation of synonymous and nonsynonymous rates in the 9 outgroup lineage, is that it can provide estimates of the mutational load of the current pandemic. 10 The d N /d S ratio is almost 14 times larger in the circulating SARS-CoV-2 strains than in the 11 outgroup lineage. While some of this difference could possibly be explained by positive 12 selection acting at a higher rate after zoonotic transfer, it is perhaps more likely that a 13 substantial proportion of segregating nonsynonymous mutations are deleterious, suggesting a 14 very high and increasing mutation load in circulating SARS-CoV-2 strains. 15 16 We are grateful to Dr. E.C Holmes for providing the genome sequence of RmYN02. We thank 18 Dr. We also thank 2. replicates. The associated distance matrix for (b) and (c) can be found in Table 3 . Basic local alignment search tool Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 An exact nonparametric method for inferring mosaic 10 structure in sequence triplets Hearing silence: non-neutral evolution at 12 synonymous sites in mammals Relaxed phylogenetics and dating with 14 confidence PHYLIP (Phylogeny Inference Package) version 3.7a. Distributed by the 16 author MAFFT multiple sequence alignment software version 7: 4 improvements in performance and usability Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins A mixed relaxed clock model Emergence of SARS-CoV-2 through recombination and strong purifying 14 selection Phylogeny-aware alignment with PRANK Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus 18 origins and receptor binding RDP: detection of recombination amongst aligned sequences Evaluation of methods for detecting recombination from DNA 3 sequences: computer simulations Identification of breakpoints in 5 intergenotypic recombinants of HIV type 1 by bootscanning Analyzing the mosaic structure of genes Detection of novel coronaviruses in bats in Myanmar Coronavirus from Wuhan: an Analysis Based on Decade-Long Structural Studies of SARS 13 A new coronavirus associated with human respiratory disease in China Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins PAML 4: phylogenetic analysis by maximum likelihood The mean of d S estimates using different methods; ML.corr and yn00.corr are the bias corrected versions of the ML and yn00 methods, respectively. (b) Errors in d S estimates as measured using the ratio of square root of mean squared error (MSE) to true d S . All the estimates are based on 10,000 simulations. ML: maximum-likelihood estimates using the f3x4 model in codeml; ML.corr, maximumlikelihood estimates with bias correction; yn00, count-based estimates in (Yang and Nielsen 2000); yn00.corr, yn00 estimates with bias correction