key: cord-103341-q7yqnvm2 authors: MacPherson, Ailene; Louca, Stilianos; McLaughlin, Angela; Joy, Jeffrey B.; Pennell, Matthew W. title: A General Birth-Death-Sampling Model for Epidemiology and Macroevolution date: 2020-10-11 journal: bioRxiv DOI: 10.1101/2020.10.10.334383 sha: doc_id: 103341 cord_uid: q7yqnvm2 Birth-death stochastic processes are the foundation of many phylogenetic models and are widely used to make inferences about epidemiological and macroevolutionary dynamics. There are a large number of birth-death model variants that have been developed; these impose different assumptions about the temporal dynamics of the parameters and about the sampling process. As each of these variants was individually derived, it has been difficult to understand the relationships between them as well as their precise biological and mathematical assumptions. And without a common mathematical foundation, deriving new models is non-trivial. Here we unify these models into a single framework, prove that many previously developed epidemiological and macroevolutionary models are all special cases of a more general model, and illustrate the connections between these variants. To do so, we develop a novel technique for deriving likelihood functions for arbitrarily complex birth-death(-sampling) models that will allow researchers to explore a wider array of scenarios than was previously possible. As an illustration of the utility of our mathematical approach, we use our approach to derive a yet unstudied variant of the birth-death process in which the key rates emerge deterministically from a classic susceptible infected recovered (SIR) epidemiological model. whether the rates are interpreted as describing pathogen transmission or macroevolutionary diversification. 75 In the model, transmission/speciation results in the birth of a lineage and occurs at rate ⁄(· ), where · 76 (0 AE · AE t or ) is measured in units of time before the present day, such that ⁄ can be time-dependent. We 77 make the common assumption that lineages in the viral phylogeny coalesce exactly at transmission events, 78 thus ignoring the pre-transmission interval inferred in a joint phylogeny of within-and between-host [32] . 79 Throughout, we will use · as a a general time variable and t ◊ to denote a specific time of an event ◊ time 80 units before the present day (see Table S1 ). Lineage extinction, resulting from host recovery or death in the 81 epidemiological case or the death of all individuals in a population in the macroevolutionary case, occurs at 82 time-dependent rate µ(· ). We allow for two distinct types of sampling: lineages are either sampled according 83 to a Poisson process through time Â(· ) or binomially at very short intervals, which we term "concerted sam-84 pling attempts" (CSAs), where lineages at some specified time t l are sampled with probability fl l (f denotes 85 a vector of concerted sampling events at di erent time points). In macroevolutionary studies based only on 86 extant lineages, there is no Poissonian sampling, but a CSA at the present (i.e., fl 0 > 0). In epidemiology, 87 CSAs correspond to large-scale testing e orts (relative to the background rate of testing) in a short amount 88 of time (relative to the rates of viral sequence divergence); for full explanation, see Supplementary Material 89 section S1.2.3. We call these attempts rather than events because if fl is small or the infection is rare in the 90 population, few or no samples may be obtained. CSAs can also be incorporated into the model by including 91 infinitesimally short spikes in the sampling rate  (more precisely, appropriately scaled Dirac distributions). 92 Hence, for simplicity, in the main text we focus on the seemingly simpler case of pure Poissonian sampling 93 through time except at present-day, where we allow for a CSA to facilitate comparisons with macroevolu-94 tionary models; the resulting formulas can then be used to derive a likelihood formula for the case where 95 past CSAs are included (Supplement S1.2.3). 96 In the epidemiological case, sampling may be concurrent (or not) with host treatment or behavioural 97 changes resulting in the e ective extinction of the viral lineage. Hence, we assume that sampling results 98 in the immediate extinction of the lineage with probability r(· ). Similarly, in the case of past CSAs we must 99 include the probability, r l , that sampled hosts are removed from the infectious pool during the CSA at time 100 ary case to explicitly model the collection of samples from the fossil record (e.g., the fossilized birth-death 102 process [19]). For our derivation, we make no assumption about the temporal dynamics of ⁄, µ, Â, or r; each may be 104 constant, or vary according to any arbitrary function of time given that it is biologically valid (i.e., non-105 negative and between 0 and 1 in the case of r). We make the standard assumption that at any given time any 106 given lineage experiences a birth, death or sampling event independently of (and with the same probabilities 107 as) all other lineages. We revisit this assumption in Box 1 where we discus how the implicit assumptions of 108 the BDS process are well summarized by the diversification model's relationship to the SIR epidemiological 109 model. Our resulting general time-variable BDS process can be fully defined by the parameter set BDS = 110 {⁄(· ), µ(· ), Â(· ), r(· ),f}. In order to make inference about the model parameters, we need to calculate the likelihood, L, that an 112 observed phylogeny, T, is the result of a given BDS process. With respect to the BDS process there are 113 two ways to represent the information contained in the phylogeny T, both of which have been used in the 114 literature, which we call the "edge" and "critical time" representations, respectively. We begin by deriving 115 the likelihood in terms of the edge representation and later demonstrate how to reformulate the likelihood 116 in terms of critical times. In the edge representation, the phylogeny is summarized as a set of edges in the 117 mathematical graph that makes up the phylogeny, numbered 1-11 in Figure B1 , and the types of events that 118 occurred at each node. We define g e (· ) as the probability that the edge e which begins at time s e and ends at 119 time t e gives rise to the subsequently observed phylogeny between time ·, (s e < · < t e ) and the present day. The likelihood of the tree then, is by definition g stem (t or ): the probability density the stem lineage (stem = 1 121 in Figure B1 ) gives rise to the observed phylogeny from the origin, t or , to the present day. Although it is 122 initially most intuitive to derive the likelihood in terms of the edge representation, as we show below, it is 123 then straightforward to derive the critical times formulation which results in mathematical simplification. Deriving the Initial Value Problem (IVP) for g e (· ): We derive the IVP for the likelihood density g e (· ) 125 using an approach first developed by [11] . For a small time · the recursion equation for the change in the 126 likelihood density is given by the following expression. (1) Here, E(· ) is the probability that a lineage alive at time · leaves no sampled descendants at the present day. 129 We will examine this probability in more detail below. Assuming · is small, we can approximate the above 130 recursion equation as the following di erence equation. 131 g e (· ) ¥ ≠(⁄(· ) + µ(· ) + Â(· )) · g e (· ) + 2⁄(· )g e (· )E(· ) · + O( · 2 ). (2) By the definition of the derivative we have: 133 dg e (· ) d· = ≠(⁄(· ) + µ(· ) + Â(· ))g e (· ) + 2⁄(· )g e (· )E(· ). where the auxiliary function, , is given by: This function, (s, t), maps the value of g e at time s to its value at t, and hence is known as the probability 145 "flow" of the Kolmogorov backward equation [35] . Deriving the IVP for E(· ): We derive the IVP for E(· ) in a similar manner as above, beginning with a 147 di erence equation. 149 By the definition of a derivative we have: where fl 0 is the probability a lineage is sampled at the present day. The initial condition at time 0 is therefore 152 the probability that a lineage alive at the present day is not sampled. Given an analytical or numerical general 153 solution to E(· ), we can find the likelihood by evaluating g stem (t or ), as follows. Deriving the expression for g stem (t or ): Given the linear nature of the di erential equation for g e (· ) and 155 hence the representation in Equation (5)), the likelihood g stem (· ) is given by the product over all the initial 156 conditions times the product over the probability flow for each edge. Representing g stem (t or ) in terms of critical times: Equation (9) can be further simplified by removing the 159 need to enumerate over all the edges of the phylogeny (the last term of Equation (9)) and writing L in terms 173 the probability flow can be rewritten as the following ratio: This relationship allows us to rewrite the likelihood by expressing the product over the edges as two separate 181 Note (0) = 1. it is often appropriate to condition the tree likelihood on the tree exhibiting some property, for example the 184 condition there being at least sampled lineage. Imposing a condition on the likelihood is done by multiplying 185 by a factor S. Various conditioning schemes are considered in section S1.4 and listed in Table S3 . The 186 resulting likelihood of the general BDS model is: provides a consistent notation for unifying a multitude of seemingly disparate models, it also provides a con- Figure B1 ) are time independent. In this 258 case the deterministic compartmental model is given by the following initial value problem. Step 1: Model Specification -As discussed in box 1, SIR compartmental models can be used to constrain 266 BDS rates by setting ⁄(· ) = -S(t or ≠ · ) and µ(· ) = " + " + -, which under the present assumptions 267 is a constant. The sampling rate is also assumed constant Â(· ) = Â. We will assume that all samples are 268 acquired through Poissonian sampling at a constant rate, hence we include neither CSAs at in the past nor at 269 the present-day. Upon sampling we will assume that all lineages are removed with a constant probability r. 270 Finally, we will condition the likelihood on the observation of at least one sample since the time of the most 271 recent common ancestor. To avoid extending the inference to points very early in the spread of the infection 272 we will condition on observing at least one sampled lineage since the time of most recent common ancestor, 273 S 6 . Step 2: IVP for g e (· ) -The initial value problem for g e (· ) is straightforward to derive: Step 3: IVP for E(· ) -Similarly we have: Step 4: Expression for g stem (t or ). The expression for g stem (t or ) in the case of no present-day sampling can 279 be obtained from 9 and setting N 0 = 0. Step 5: g stem (t or ) in terms of the critical times -We can then transform the solution into the critical time Step 6: Conditioning the likelihood -Finally, we condition the likelihood on observing at least one lineage 285 given the TMRCA, Here we derive a phylogenetic birth-death-sampling model in a more general form than previously attempted, 294 making as few assumptions about the processes that generated the data as possible. While drawing inferences 295 from data will require making additional assumptions and applying mathematical constraints to the param- class (e.g., ⁄(· ) = -S(· )) and the viral death rate to the infectious recovery or removal rate µ(· ) = "+"+-, 353 whereas the sampling rate Â(· ) is identical across models. While constraining the birth, death, and sampling 354 rates in this manner can be used to parameterize compartmental models (e.g., [23]) doing so is an approxima-355 tion assuming independence between the exact timing of transmission, recovery or removal from population, 356 and sampling events in the SIR model and birth, death, and sampling events in the diversification model. The Unifying the epidemiological and evolutionary dynamics of pathogens Complex Population Dynamics and the Coalescent Under Neutrality On the Genealogy of Large Populations An Integrated Framework for the Inference of Viral 373 Population History From Reconstructed Genealogies Exploring the Demographic History of DNA Sequences Using the 375 Bayesian Coalescent Inference of Past 377 Population Dynamics from Molecular Sequences Phylodynamics of 380 Infectious Disease Epidemics How well can the exponential-growth coalescent approximate constant-rate 382 birth-death population dynamics? Inference of Epidemiological Dynamics Based on 384 Simulated Phylogenies Using Birth-Death and Coalescent Models On the Generalized "Birth-and-Death Estimating a Binary Character's E ect on Speciation and 389 Extinction On incomplete sampling under birth-death models and connections to the 391 sampling-based coalescent Sampling-through-time in birth-death trees Mathematical models of cladogenesis The reconstructed evolutionary process Phylogenetic approaches for studying diversification Simulating trees with millions of species The conditioned reconstructed process The fossilized birth-death process for coherent 403 calibration of divergence-time estimates Estimating the Basic Reproductive 406 Number from Viral Sequence Data Birth-death skyline plot reveals temporal 408 changes of epidemic spread in HIV and hepatitis C virus (HCV) Uncovering epidemiological dynamics in heterogeneous host 410 populations using phylogenetic methods Simultaneous reconstruction of 413 evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR 414 model Bayesian Inference of Sampled Ancestor 416 Trees for Epidemiology and Fossil Calibration Birth-Death Models in Macroevolution Reconciling molecular phylogenies with the fossil record Mammalian phylogeny reveals recent diversification rate shifts Getting to the root of epidemic spread with phylodynamic analysis of 425 genomic data The spread of hepatitis c virus genotype 1a in north america: a retrospective 427 phylogenetic study Speciation gradients and the distribution of biodiversity Extant timetrees are consistent with a myriad of diversification 431 histories Timing and order of transmission 433 events is not directly reflected in a pathogen phylogeny Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung Proceedings of the First Berkeley Symposium on Mathematical Statistics and A General and E cient Algorithm for the Likelihood of Diversification 440 and Discrete-Trait Evolutionary Models E cient comparative phylogenetics on large trees BEAST 2.5: An advanced software platform for Bayesian evolutionary 444 analysis Recursive algorithms for phylogenetic tree 446 counting Inferring epidemiological dynamics with 448 Bayesian coalescent inference: the merits of deterministic and stochastic models E cient comparative phylogenetics on large trees General models of multilocus evolution A general consumer-resource population model Phylodynamics with Migration: A 457 Computational Framework to Quantify Population Structure from Genomic Data A Multitype Birth-Death Model for Bayesian 460 Inference of Lineage-Specific Birth and Death Rates Origination and extinction through the phanerozoic: a new approach Improved estimation of macroevolutionary rates 464 from fossil data using a bayesian framework Explosive evoltuionary radiation: decreasing specaition or increasing 466 extinction through time? Modeling Infectious Diseases: In Humans and Animals Density-dependent diversification in North American wood warblers Prolonging the Past Counteracts the Pull of the Present: Protracted 473 Speciation Can Explain Observed Slowdowns in Diversification Detection of HIV transmission clusters from 476 phylogenetic trees using a multi-state birth-death model Age-Dependent Speciation Can Explain the Shape 479 of Empirical Phylogenies Inferring Epidemic Contact Structure from Phylogenetic Trees Estimating Epidemic Incidence and Prevalence from Genomic Data . Arising from a BDS process, this sampled tree can be summarized in two ways. First by the set of edges (labeled 1-11) or as a set of critical times (horizontal lines) including: 1) the time of birth events (solid, x i ) 2) terminal sampling times (dashed, y j ), and 3) ancestral sampling times (dotted, z k ). Given the inferred rates from a reconstructed sampled tree, these rates can be used to estimate characteristic parameters of the SIR model, for example the basic or e ective reproductive number.