key: cord-313247-55loucvc authors: Pipes, Lenore; Wang, Hongru; Huelsenbeck, John P.; Nielsen, Rasmus title: Assessing uncertainty in the rooting of the SARS-CoV-2 phylogeny date: 2020-10-07 journal: bioRxiv DOI: 10.1101/2020.06.19.160630 sha: doc_id: 313247 cord_uid: 55loucvc The rooting of the SARS-CoV-2 phylogeny is important for understanding the origin and early spread of the virus. Previously published phylogenies have used different rootings that do not always provide consistent results. We investigate several different strategies for rooting the SARS-CoV-2 tree and provide measures of statistical uncertainty for all methods. We show that methods based on the molecular clock tend to place the root in the B clade, while methods based on outgroup rooting tend to place the root in the A clade. The results from the two approaches are statistically incompatible, possibly as a consequence of deviations from a molecular clock or excess back-mutations. We also show that none of the methods provide strong statistical support for the placement of the root in any particular edge of the tree. Our results suggest that inferences on the origin and early spread of SARS-CoV-2 based on rooted trees should be interpreted with caution. SARS-CoV-2, the virus causing COVID-19 or 'Severe Acute Respiratory Syndrome,' has a single-stranded RNA genome 29,891 nucleotides in length Zhou et al., 2020b) . The exact origin of the virus causing the human pandemic is unknown, but two coranoaviruses isolated from bats -RaTG13 isolated from Rhinolophus affinis (Zhou et al., 2020a) and RmYN02 isolated from Rhinolophus malayanus (Zhou et al., 2020b) , both from the Yunnan province of China -appear to be closely related. After accounting for recombination, the divergence time between these bat viruses and SARS-CoV-2 is estimated to be approximately 52 years [95% C.I. (28, 75) ] and 37 years [95% C.I. (18,56) ] , for RaTG13 and RmYN02 respectively, using a strict clock, only the most closely related sequences, and only synonymous mutations, or 51 years [95% HPD credible interval (40, 70)] for RaTG13 (Boni et al., 2020) Tang et al., 2020; Yu et al., 2020; Zhang et al., 2020) , analyses that used midpoint rooting reached another placement (Li et al., 2020c, d; Nie et al., 2020) , and yet other analyses using a molecular clock have reached a different placement of the root Giovanetti et al., 2020; Lemey et al., 2020; Li et al., 2020a) . In particular, there is considerable discrepancy between rootings based on rooting with the two closest outgroup sequences (Fig 1A) , which has a rooting in clade A, and rooting based on a molecular clock (Fig 1B) , which has a rooting in clade B, using clade designations by Rambaut et al. (2020) . Clade B contains the earliest sequences from Wuhan, and a rooting in this clade would be compatible with the epidemiological evidence of an origin of SARS-CoV-2 in or near Wuhan. However, if an outgroup rooting is assumed ( Fig 1A) the inferred origin is in Clade A which consists of many individuals from both inside and outside East Asia. Such a rooting would be compatible with origins of SARS-CoV-2 outside of Wuhan. The rooting of the SARS-CoV-2 pandemic is, therefore, critical for our understanding of the origin and early spread of the virus. However, it is not clear how best to root the tree and how much confidence can be placed in any particular rooting of the tree. There are many different methods for inferring the root of a phylogenetic tree, but they largely depend on three possible sources of information: outgroups, the molecular clock, and non-reversibility. The latter source of information can be used if the underlying mutational process is non-reversible, that is, for some pair of nucleotides (i,j), the number of mutations from i to j differs from the number of mutations from j to i, in expectation at stationarity. However, this source of information is rarely used to root trees because it relies on strong assumptions regarding the mutational process, and it has been shown to perform poorly on real data (Huelsenbeck et al., 2002) . Most studies use methods based on either outgroup rooting, molecular clock rooting, or a combination of both. Outgroup rooting is perhaps the conceptually easiest method to understand, and arguably the most commonly used method. In outgroup rooting, the position in which one or more outgroups connects to the ingroup tree is the root position. Outgroup rooting can be challenged by long-branch attraction if distant outgroups are being used (e.g. Felsenstein, 1978; Graham et al., 2002; Hendy and Penny, 1989; Maddison et al., 1984) . In such cases, the outgroup will have a tendency to be placed on the longest branches of the ingroup tree. In viruses, in particular, because of their high mutation rate, it can be challenging to identify an outgroup sequence that is sufficiently closely related to the ingroup sequences to allow reliable rooting. An alternative to outgroup rooting is molecular clock rooting, which is based on the assumption that mutations occur at an approximately constant rate, or at a rate that can be modeled and predicted using statistical models (e.g., using a relaxed molecular clock such as Drummond et al. (2006) ; Yoder and Yang (2000) ). The rooting is then preferred that makes the data most compatible with the clock assumption by some criterion. Early methods for rooting using molecular clocks were often labeled midpoint rooting as some original methods were based on placing the root halfway between the most distant leaf nodes in the tree (e.g. Swofford et al., 1996) . More modern methods use more of the phylogenetic information, for example, by finding the rooting that minimizes the variance among leaf nodes in their distance to the root (e.g. Mai et al., 2017) or produces the best linear regression of root-to-tip distances against sampling times when analyzing heterochronous data (Rambaut et al., 2016) . Methods for inferring phylogenetic trees that assume an ultrametric tree (i.e. a tree that perfectly follows a molecular clock), such as unweighted pair group method with arithmetic mean (UPGMA; Sokal and Michener, 1958) , directly infers a rooted tree. Similarly, Bayesian phylogenetic methods using birth-death process priors (Kendall, 1948; Thompson, 1975) or coalescent priors (Kingman, 1982a, b, c) also implicitly infers the root. But even with uninformative priors on the tree the placement of the root can be estimated in Bayesian phylogenetics using molecular clock assumptions. An advantage of such methods, over methods that first infer the branch lengths of the tree and then identify the root most compatible with a molecular clock, is that they explicitly incorporate uncertainty in the branch length estimation when identifying the root and they simultaneously provide measures of statistical uncertainty in the rooting of the tree. Boni et al., 2020; Wang et al., 2020) . Recombination in the outgroups is at odds with the assumption of a single phylogenetic tree shared by all sites assumed by phylogenetic models when using outgroup rooting, particularly if more than one outgroup is included in the analysis. To investigate the possible rootings of the SARS-CoV-2 phylogeny we used six different methods and quantified the uncertainty in the placement of the root for each method on the inferred maximum likelihood topology. We note that the question of placement of a root, is a question idiosyncratic to a specific phylogeny, and to define this question we fixed the tree topology, with the exception of the root placement, in all analyses. In all cases, we applied the method to the alignment of 132 SARS-CoV-2 sequences and two putative outgroup sequences, RaTG13 and RmYN02, (see Table S1 ) that was constrained such that the proteincoding portions of the SARS-CoV-2 genome were in frame, and is described in detail in Wang et al. (2020) . To ensure that we could accurately capture the rooting from available sequences, the sequences used for the analysis are chosen to be representative of the basal branches of the phylogeny and/or were early sequenced strains. There are two orders of magnitude more strains available in public databases, however these sequences are more terminally located and would provide little additional information about the placement of the root but have the potential to add a significant amount of additional noise. We are therefore focusing our efforts on the limited data set of early sequences. However, we note that future inclusion of more sequences with a basal The topology of the tree is shown in Figure 2 . The outgroup sequences were pruned from the tree using nw prune from Newick utilities v1.6 (Junier and Zdobnov, 2010) . Bootstrapping was preformed using the RAxML-NG --bootstrap option. For the RaTG13+RmYN02 analysis, only bootstrapped trees that formed a monophyletic group for RaTG13 and RmYN02 were kept. The clades of the tree were assigned according to nomenclature proposed by Rambaut et al. (2020) where the A and B clades are defined by the mutations 8782 and 28144 and based on whether or not they share those sites with RaTG13. The six different methods for identifying the root of the SARS-CoV-2 phylogeny were: (1) Outgroup rooting using RaTG13. We constrained the tree topology to be equal to the unrooted SARS-CoV-2 phylogeny, i.e. the only topological parameter estimated was the placement of the RaTG13 sequence on the unrooted SARS-CoV-2 phylogeny. We masked the potential recombination segment (NC 045512v2 positions 22851 to 23094) in RaTG13 identified in Wang et al. (2020) from the alignment. To quantify uncertainty we obtained 1,000 bootstrap samples. We note that while interpretation of bootstrap proportions in phylogenetics can be problematic (see Efron et al., 1996) we performed 1,000 parametric simulations using pyvolve (Spielman and Wilke, 2015) using maximum likelihood estimates, from the original data set, of the model of molecular evolution and the phylogenetic tree, including branch lengths (see Table S2 ). For each simulation, we then estimated the tree using the same procedure as used for the real data for both the simulated data, and for 100 bootstrap replicates. We then constructed confidence sets by (2) Outgroup rooting using RmYN02. We used the same methods as in (1) Rambaut (2000 Rambaut ( , 2009 applied to the maximum likelihood tree. This method uses the molecular clock to root the tree. We again quantified uncertainty using 1,000 bootstrap samples. To investigate differences in signs of temporal signal for the outgroup rooting and the molecular clock rooting, we calculated root-to-tip distances using TempEst v1.5.3 (Rambaut et al., 2016) for the ML tree using the outgroup rooting ( Fig S2) and a re-rooting of the ML tree using the molecular clock rooting (Fig S3) . Re-rooting was performed using nw reroot from Newick utilities v1.6 (Junier and Zdobnov, We estimated the maximum clade credibility tree using a time-measured Bayesian phylogenetic reconstruction implemented in BEAST (Suchard et al., 2018) v1.10.4. We used a GTR+Γ substitution model and the uncorrelated relaxed clock with a lognormal distribution, and specified flexible skygrid coalescent tree priors. TreeAnnotator was used to annotate the maximum clade credibility tree. /T h a il a n d /6 1 /2 0 2 0 S A R S -C o V -2 /h u m a n /C H N /W u h a n Y B 0 1 2 5 0 6 /2 0 2 0 U S A /W I1 /2 0 2 0 S A R S -C o V -2 /h u m a n /G u a n g z h o u /I Q T C 0 5 /2 0 2 0 S A R S -C o V -2 /I Q T C 0 3 /h u m a n /2 0 2 0 /C H N S A R S -C o V -2 /I Q T C 0 2 /h u m a n /2 0 2 0 /C H N S A R S -C o V -2 /h u m a n /C H N /B e ij in g IM E -B J 0 5 /2 0 2 0 S A R S -C o V -2 /h u m a n /C H N /Y N -0 3 0 6 -4 6 6 /2 0 2 0 S A R S -C oV -2 /h um an / 2. : The unrooted maximum likelihood topology for the SARS-CoV-2 phylogeny of 132 genomes with probabilities for six rooting methods. Only branches with probability greater than 0.05 for at least one method are shown. Branch lengths are not to scale. The software package pangolin (https://github.com/hCoV-2019/pangolin, updated on 2020-05-01) was used for lineage assignment based on lineages updated on 2020-04-27. After running the software for assignment, only lineages called for sequences where both aLRT and UFbootstrap values which quantify the branch support in phylogeny construction are >80. The global spread of 2019-ncov: a molecular evolutionary analysis. Pathogens and Global Health Relaxed phylogenetics and dating with confidence Bayesian Evaluation of Temporal Signal in Measurably Evolving Populations Temporal signal and the phylodynamic threshold of sars-cov-2. bioRxiv Bootstrap confidence levels for phylogenetic trees Cases in which parsimony or compatibility methods will be positively misleading The ucsc sars-cov-2 genome browser The first two cases of 2019-ncov in italy: Where they come from Mapping genome variation of sars-cov-2 worldwide highlights the impact of covid-19 super-spreaders Rooting phylogenetic trees with distant outgroups: a case study from the commelinoid monocots A framework for the quantitative study of evolutionary trees Inferring the root of a phylogenetic tree The newick utilities: high-throughput phylogenetic tree processing in the unix shell On the generalized "birth-death" process The coalescent. Stochastic Processes and their Applications Exchangeability and the evolution of large populations Outgroup analysis and parsimony Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction Phylogenetic and phylodynamic analyses of sars-cov-2 ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R Recombination and lineage-specific mutations led to the emergence of sars-cov-2. bioRxiv Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies Path-o-gen: temporal signal investigation tool Exploring the temporal structure of heterochronous sequences using tempest (formerly patho-gen) A dynamic nomenclature proposal for sars-cov-2 to assist genomic epidemiology usa sars-cov-2 isolates reveals haplotype signatures and localized transmission patterns by state and by country. medRxiv A statistical method for evaluating systematic relationships Pyvolve: a flexible python module for simulating sequences along phylogenies Bayesian phylogenetic and phylodynamic data integration using beast 1.10. Virus evolution Molecular systematics, chapter phylogenetic inference On the origin and continuing evolution of SARS-CoV-2 Human Evolutionary Trees Emergence of genomic diversity and recurrent mutations in sars-cov-2. Infection Synonymous mutations and the molecular evolution of sars-cov-2 origins World Health Organization 2020 A new coronavirus associated with human respiratory disease in china Estimation of primate speciation dates using local molecular clocks Decoding the evolution and transmissions of the novel pneumonia coronavirus (sars-cov-2/hcov-19) using whole genomic data Origin and evolution of the 2019 novel coronavirus A novel bat coronavirus reveals natural insertions at the s1/s2 cleavage site of the spike protein and a possible recombinant origin of hcov-19 A pneumonia outbreak associated with a new coronavirus of probable bat origin We thank Dr. Adi Stern for discussion. The research was funded by Koret-UC Berkeley- Biology and Bioinformatics to RN.