key: cord-0945397-ezsvz4xp
authors: Wertheim, Joel O.; Steel, Mike; Sanderson, Michael J.
title: Accuracy in near-perfect virus phylogenies
date: 2021-05-06
journal: bioRxiv
DOI: 10.1101/2021.05.06.442951
sha: 21053e5e1962c2840db22cc6e3c6dff85f10d48c
doc_id: 945397
cord_uid: ezsvz4xp

Phylogenetic trees from real-world data often include short edges with very few substitutions per site, which can lead to partially resolved trees and poor accuracy. Theory indicates that the number of sites needed to accurately reconstruct a fully resolved tree grows at a rate proportional to the inverse square of the length of the shortest edge. However, when inferred trees are partially resolved due to short edges, “accuracy” should be defined as the rate of discovering false splits (clades on a rooted tree) relative to the actual number found. Thus, accuracy can be high even if short edges are common. Specifically, in a “near-perfect” parameter space in which trees are large, the tree length ξ (the sum of all edge lengths), is small, and rate variation is minimal, the expected false positive rate is less than ξ/3; the exact value depends on tree shape and sequence length. This expected false positive rate is far below the false negative rate for small ξ and often well below 5% even when some assumptions are relaxed. We show this result analytically for maximum parsimony and explore its extension to maximum likelihood using theory and simulations. For hypothesis testing, we show that measures of split “support” that rely on bootstrap resampling consistently imply weaker support than that implied by the false positive rates in near-perfect trees. The near-perfect parameter space closely fits several empirical studies of human virus diversification during outbreaks and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2, reflecting low substitution rates relative to high transmission/sampling rates in these viruses.

on T (by the Markovian process described above).

Given a tree T ∈ B(n), let d T (e 1 , e 2 ) denote the number of edges of T that lie 136 strictly within the path between e 1 and e 2 (i.e., excluding e 1 and e 2 ). Thus, e 1 and e 2 are 137 adjacent if and only if d T (e 1 , e 2 ) = 0. In addition, let ϕ T = (ϕ T (0), ϕ T (1), . . . , ϕ T (n − 3)), 138 where ϕ T (i) is the number of (unordered) pairs of edges {e, e } of T for which d T (e, e ) = i. 139 Finally, for i between 1 and n − 3, let 140φ

The probability of a false split is then given by the following theorem (see SI for 141 proof).

Theorem 1 For each T ∈ B(n), and k 1 we have:

Theorem 1 shows that for fixed k and n, the shape of T plays a significant role in 143 determining Φ (k) T ; in particular, unbalanced trees (such as caterpillars) will have a smaller 144 value of Φ (k) T than more balanced trees. Indeed, it is possible to calculate the value of Φ (k) T number of leaves), m, and λ (equivalently, FP T is a function of T , m, and ξ). 155 In general, it is mathematically complicated to describe F P T in terms of these 156 parameters. However, when the number of leaves in a tree grows faster than the number of 157 perfectly compatible characters, it is possible to state a limit result to provide an 158 approximation to F P T for large trees. 159 In the following theorem, we consider the following setting: 160 (I) mξ = Θ(n β ) for some 0 < β < 1 2 , and 161 (II) mξ 2 = O(1), 162 where O(1) refers to dependence on n (thus mξ 2 is not growing with n). Note that 163 Condition (I) implies that the number of perfectly evolved characters grows with the 164 number of leaves, but at a rate that is slower than linearly. Conditions (I) and (II) imply 165 that ξ decreases as n increases. 166 In this setting, we show that the false positive rate is (asymptotically) of the form ξ 3 167 times a function Ω that involves T (via its shape), m, and ξ. If we now treat ξ as a 168 variable, then for ξ = 0, the function Ω is close to 1 (for large n) and so F P T initially 169 grows like ξ/3. However, as ξ increases, Ω begins to decline at an increasing rate, resulting 170 in the false positive rate reaching a maximum value before starting to decrease.

To describe this result, we need to define this function Ω. Let ).

Notice that Ω(T n , ξ, m) depends on T n only via the coefficientsφ Tn (i), and this 174 dependence is linear. Thus, if D is a distribution on trees (e.g. the PDA or YH), then the expected value of Ω(T n , ξ, m) is given by:

For the PDA distribution, the term E P DA [φ Tn (i)] has an explicit exact value, 177 namely,

for all i between 1 and n − 3 (see SI for proof).

Theorem 2 For each n 1, let T n be a binary phylogenetic tree with n leaves, and 180 suppose that Conditions (I) and (II) hold.

(i)

· Ω(T n , ξ, m) · (1 + o(1)),

where o(1) is a term that tends to 0 as n grows.

(ii) If T n is sampled from a distribution D (e.g. PDA, YH), then the expected value of Remarks: Note that F P Tn depends only on the shape of the tree T n (and not on 183 how its leaves are labelled), thus for a tree distribution D on either the class of caterpillar 184 trees, or symmetric trees, we have F P Tn = E D [F P Tn ].

Notice also from Fig. 3 that as ξ increases from 0 the estimate of E D [F P Tn ] given by ξ 3 · Ω(T n , ξ, m) for the YH, PDA distributions and for symmetric trees initially increases (approximately linearly) with ξ but then begins to decrease with increasing ξ. By contrast, when T n has the caterpillar tree shape, the estimate of F P Tn appears to be constant as ξ increases from 0 (see Fig. 3 ). Indeed, when T n is a caterpillar tree, the expression for F P Tn in Theorem 2(i) reduces to the following remarkably simple expression as n becomes large:

F P Tn ∼ 4/(3m), and PDA random trees. Ten values of λ in the interval [10 −5 , 0.31622] were analyzed.

PAUP was used for MP bootstrapping (same heuristic search as above but with 100 212 replicates × 10 subreplicates); IQTree2 was used (50 random tree replicates) for SH-aLRT IQTree2 with a JC69 model, minimum branch lengths of 10 −9 , and collapsed polytomies.

Clade support was determined using ultrafast bootstrapping (10,000 replicates), SH-aLRT 221 (10,000 replicates), and aBayes. In addition, we also analyzed an iatrogenic HIV-1 outbreak in Cambodia (Rouet et al., 2018) and the first wave of the SARS-CoV-2 epidemic in China 237 (Pekar et al., 2021) . The SARS-CoV-2 phylogeny is the ML tree used in Pekar (Pekar et al., 238 2021) (see Data S1 for list of GISAID Accession IDs Simulations of tree inference with MP, over a large range of tree lengths, ξ, and 263 other parameters illustrate several known results (Fig. 2) and perhaps a few less well 264 known ones. First, resolution of the inferred tree increases with tree length. Second,

"overall" accuracy, as measured by the RF distance, is optimal at an intermediate tree Yang, 1998; Bininda-Emonds et al., 2001; Steel and Leuenberger, 2017) .

Moreover, when ξ >> ξ * , the false positive error rate, FP T , is similar to the false negative 268 rate, FN T , as might be expected because the true and estimated trees are nearly binary;

However, when ξ << ξ * , FP T << FN T , and the false positive error rate can remain 271 quite good (< 0.05) over a large range of ξ even when the false negative error rate is very 272 high. However, the range of tree lengths for which this result holds depends critically on 273 rate variation across edges and sites. When ξ 1, the false positive rate is low and 274 insensitive to the presence of rate variation; but, when ξ > 1, the false positive rate is 275 much more sensitive to rate variation-high when variation is present and low when absent 276 (contrast Fig. 2A and Fig. 2B ). In real-world data, as ξ increases, we expect that evidence 277 of rate variation will become more apparent. Key elements of these findings can be shown 278 analytically in a "near-perfect" zone described by a simple evolutionary model. 2 . Accuracy of maximum parsimony phylogeny reconstruction in simulations over a wide range of tree length, ξ, and other parameters. Open circles are mean false positive error rates; closed circles are mean false negative error rates (log scale left); dashed sigmoidal curve is fractional resolution of estimated tree (linear scale right). Trees are generated by a random proportional-to-distinguishable-arrangement (PDA) model for 513 taxa, from which a sequence alignment length of 1000 sites is generated. The dotted horizontal line is placed at an error rate of 0.05. Asterisk marks the location of the optimal tree length with best overall Robinson-Foulds accuracy, ξ * . Each point is mean of 1000 replicates × 100 sub-replicates (see Methods). "Near-perfect" values of ξ 1.0 are shaded. A) JC69 model with no edge length or across-site-rate variation. Because of y-axis log scaling, two y values of zero were set to 0.0001. B) JC69 model with substantial edge length and across-site-rate variation, both modeled as a gamma distribution with shape parameters αe = αASR = 0.25). distributed with mean ξ; and the probability of more than one change on an edge is low, 287 meaning multiple changes at a site occur on distinct edges. Though these conditions will 288 generate alignments dominated by "perfect" sites exhibiting no homoplasy, a few sites may 289 exhibit homoplasy even with ξ 1, which motivates the term "near-perfect". Under these 290 conditions, tree reconstruction methods will tend to infer relatively unresolved trees unless 291 the number of sites is very large.

Rare sites that exhibit homoplasy can introduce false positive splits on the inferred tree ( Fig. 1) . A naïve argument using Equation 1 might suggest that FP T would depend on

, namely the ratio of the expected numbers of sites having changes on two edges (i.e., those that are potentially homoplastic and misleading) to those sites having only a single change (those that are reliable), for sufficiently small ξ. But because only one-third of those two-edge sites are actually homoplastic in a JC69 model,

which implies FP T is small when ξ is small enough (e.g., FP T < 0.05 whenever ξ < 0.15).

This approximation can be improved further by recognizing that not all two-edge 294 homoplastic sites induce false positives, depending on their position in the true tree ( Fig.   295 1). Given the evolutionary model, the probability that k perfect sites, and another site f 296 that has evolved with two edge changes will produce a "false positive" under MP is

T (Theorem 1 above). Because this probability is often less than one, FP T can 298 remain below 0.05 at higher values of ξ than the naïve argument suggests.

If the true tree were known with some precision, the first part of Theorem 2 could 300 be used directly to calculate false positive rates. However, in the "near-perfect" parameter 301 space of large n and ξ 1, estimates of the true tree are likely to be only partially resolved 302 (Fig. 2) . We therefore derive the expected false positive rate for a distribution, D, of 303 randomly generated trees of size n, E D [FP Tn ], generated from parameters based on the 304 inferred tree. In the remainder of this paper, the "expected false positive rate" will 305 generally refer to E D [FP Tn ]. We assume that D is usually either a "proportional-to-distinguishable-arrangement" (PDA) or Yule-Harding (YH) distribution 307 (Aldous, 2001) , but also consider the two extreme cases of completely unbalanced 308 (caterpillar) trees, and completely balanced (symmetric) trees. Unlike PDA and YH trees, 309 these last two have a constant tree shape (with random leaf labels). From the second part 310 of Theorem 2, we see that, for a JC69 model and trees inferred with MP, the following 311 approximation holds increasingly well as n increases:

given the assumption that ξ is sufficiently small and the number of sites does not grow too 313 quickly with the size of the tree. The function Ω(T n , ξ, m), defined in Materials and

Methods, is monotonically decreasing in ξ and m, and depends on the shape of T .

Simulations indicate that the approximation is close for ξ 1 (Fig. 3 ), but if many equally 316 parsimonious trees are present, the search algorithm should take a strict consensus of a 317 broad sample of those solutions (Fig. S3 ). E D [FP Tn ] is better on average for PDA than YH 318 trees, and both are bounded between a theoretical worst case error rate for symmetric and 319 best case error rate for caterpillar trees. In fact, the expected false positive rate for the 320 latter is just 4/(3m) in the limit of large n, which is independent of ξ. but even when ASR variation is large (gamma shape parameter α ASR = 0.1), the false 332 positive rate remains slightly below 5% for simulated dataset sizes in the absence of EL 333 variation (Fig. S4 ).

Departure of the substitution model from the JC69 model assumed in the 335 "near-perfect" zone can also increase the expected false positive rate. For example, a 336 strong transition-transversion bias increases E D [FP Tn ] substantially, though it still remains 337 well below 5% under our typical simulation conditions when ξ 1 (Fig. S5 ).

Thus, the near-perfect tree length of ξ 1 is a region in which rate variation 339 appears to have less of an impact on false positive rates than when tree lengths are longer.

This suggests that the definition of near-perfect zone in practice can include substantial 341 rate variation as long as ξ 1. 

We estimated key parameters from the trees and underlying data for 11 empirical 344 virus phylogenies ( 

Epidemics with young crown group ages on the order of years or decades (e.g., Zika (Table 1) for maximum parsimony (MP) inference, estimated by simulation using parameters estimated from the data (Table S1 ). Abbreviations given in Table 1 . Simulation experiments assumed ASR variation according to either an invariant sites model (circles) or a gamma distributed model (triangles) and a Yule-Harding random tree distribution (each point is mean of 500 replicates × 100 sub-replicates). Model point with higher likelihood is shaded. False positive rates assuming PDA random trees are uniformly slightly better (Fig. S6 ). The near-perfect zone of ξ 1.0 is shaded. Horizontal dashed line indicates a 0.05 expected false positive rate. Nonetheless, some differences were observed, which tended to imply better accuracy for First, the computational expense of ML searches makes it tempting to undertake fewer 378 replicate searches for local optima, but this was as critical to improve the fit to Equation 6 379 for ML as it was for MP (Fig. S7) . Second, ML programs set hard numerical lower bounds 380 strictly greater than zero on edge lengths, often (by default) on the same order asλ, so 381 these must be reset downward to obtain correct tree likelihoods (Morel et al., 2020).

Finally, inferred edge lengths that are larger than these programs' lower bounds but still 383 smaller than about 1/m tend to be included in the ML tree despite weak evidence 384 (IQTree2 issues a warning about this). We saw this in ML searches roughly when ξ 0.1, 385 when three-state sites become more common in alignments than they were at lower values 386 of ξ. Even without homoplasy, ML tends to over-resolve trees in a way that elevates 387 E D [FP Tn ]. By collapsing short edge lengths inferred by ML to be less than 1/m, this 388 behavior can be mitigated (Fig. S7 ).

In general, ML is expected to be more accurate than MP under more realistic Notably, aBayes is the only one of the four metrics that is not based on resampling. 403 We explored other factors impacting support in the boundary case of perfect trees.

For sequence length, we computed standard support metrics in an ML framework in 405 perfect four-taxon datasets, in which each branch was defined by a single change, and 406 alignments range between 40 nt and 30,000 nt (Fig. S8 On the other hand, in perfect trees from 8-128 taxa, in which the mean edge length 417 remained the same (but therefore ξ grew with n), mean SH-aLRT and aBayes support was 418 unchanged, but mean ultrafast bootstrap support increased (Fig. S9 ).

In this paper, we study a "near-perfect" parameter space for phylogenetic inference 421 on large trees with small tree lengths and no rate variation within or between sites or 422 edges. The "near-perfect" tree length of ξ 1 means that few sites exhibit homoplasy and, 423 for MP inference, the false positive rate can be much better than the false negative rate 424 and well under 5% for typical datasets with thousands of sites. The near-perfect conditions 425 defined here to allow mathematical derivations appear to be sufficient but not necessary.

For example, with no rate variation, the false positive rate can be very good even when 427 ξ > 1 ( Fig. 2A, S5) , and, if ξ < 1, a substantial level of rate variation can be present without elevating the false positive rate by nearly as much as when ξ > 1 (Fig. 2,4 , S4).

The second case is clearly more relevant in real-world data. The 11 empirical virus 430 datasets all had substantial rate variation and showed a general increase in false positive 431 rate with ξ, with almost all rates below 5% occurring when ξ 1, much like the predicted 432 patterns seen in Fig Moreover, m also grows as O(1/λ 2 min ) for ML and some more ad hoc estimators (Erdös 464 et al., 1999; Roch, 2019) , implying again that short edges tend to degrade accuracy when 465 accuracy is defined in terms of total agreement between T andT , unlike here. 466 A cryptic factor affecting the false positive rate is tree shape. Highly asymmetric 467 trees have better expected false positive rates than highly symmetric trees, because expected path lengths are longer and it is harder to induce false positive splits by chance 469 (Fig. 1) . Thus, a random sample of PDA trees will have a better E D [FP Tn ] than more 470 symmetrical YH trees. Differences in tree shape among RNA virus phylogenies have long 471 been noted (Grenfell et al., 2004) , such as the typically more asymmetric influenza trees.

Perfect and near-perfect phylogenies have been studied as discrete optimization 473 problems (Gusfield, 1997; Fernandez-Baca and Lagergren, 2003) in which the goal is to Model-based phylogenetic inference methods such as ML and Bayesian inference are 483 generally regarded as theoretically superior to MP, especially for datasets that fit 484 substitution models much more complex than our "near-perfect" JC69 model with no rate 485 variation. Though our mathematical results for expected false positive rates were derived 486 for MP, there is both relevant theory and considerable simulation evidence to suggest that 487 in the near-perfect zone, the ML expected false positive rate is approximated by the MP 488 theory, both in terms of its absolute value and its shape as a function of tree length. As ξ 489 increases, especially above ξ * , ML consistently has better accuracy than MP, but we 490 conjecture that the false positive rates of MP and ML differ much less as ξ gets very small.

Further work is needed to test this conjecture.

The connection between the false positive rate as a measure of accuracy and 493 conventional measures of phylogenetic support appears to be sensitive to the choice of 494 support method when ξ << 1 (Fig. 6) , 1993; Felsenstein and Kishino, 1993; Efron et al., 1996; Susko, 2008 Susko, , 2009 Alfaro and 499 Holder, 2006; Simmons and Norton, 2014) , but remains somewhat fraught. We hesitate to 500 draw firm conclusions without a formal analysis of support in the "near-perfect" parameter 501 space but note the variability in support estimates we found (Fig. 6 ).

The low false positive rate in near-perfect trees suggests that phylogenies describing We gratefully acknowledge the authors from the originating laboratories and the 518 submitting laboratories, who generated and shared via GISAID the SARS-CoV-2 genomic 519 sequence data on which this research is based. A complete list acknowledging the authors 520 who submitted data analyzed in this study can be found in Data S1. MJS thanks the support. JOW was supported by an NIH-NIAID R01 AI135992. distinct edges of T (mostly we will deal with the case c = 2). The probability that a 18 character evolves on T with 2 edge changes on e 1 , e 2 is p 2 (1 − p) 2n−5 , while the probability 2 that a character evolves on T with 2 state changes is 2n−3 2 p 2 (1 − p) 2n−5 .

We pause here to make an observation: If a data set consists of characters each of 21 which have evolved on T with either 1 or 2 edge changes, then the maximum parsimony 22 tree(s) and the maximum compatibility tree(s) for this data set will be exactly the same.

Moreover, if there are sufficiently many constant characters, then any maximum likelihood 24 tree will also be one of these trees.

Recall that a split refers to a bipartition of the leaf set [n] into two non-empty 26 subsets (and splits are induced by binary characters). A character that has evolved 27 perfectly on T produces a split, and these splits (across a set of perfectly evolved 28 characters) are compatible and so form a (generally unresolved/non-binary) tree on leaf set 29

[n].

Now suppose that m characters evolve on T and suppose that, of these m 31 characters, k of them are perfectly evolved on T (note that more than one of these 32 characters may correspond to the same split of T ).

Next, consider a single additional character f which has evolved on T with two edge 34 changes on e 1 , e 2 (there is no restriction that these are interior edges). The probability that 35 this character f is a binary character is 1 3 (as noted earlier). Moreover, in that case, the 36 bipartition induced by f corresponds to a split of T if and only if e 1 and e 2 are adjacent.

The question now arises as to whether we can discover that f gives a false split of T 38 (without knowing T ), just on the basis of the other k perfectly-evolved characters.

We next introduce some further notation. We will let ξ denote the expected number of state changes in the tree T per character, under the model described. Thus ξ equals the per-edge rate λ times the number of edges of T , and so ξ = λ · (2n − 3).

We will also use the following standard notation throughout: we write f (n) ∼ g(n) 40 if the ratio f (n)/g(n) tends to 1 as n becomes large. Note that the sequence ϕ T is a topological invariant (i.e. it depends only on the 51 unlabelled shape of the tree) and does not depend on any other parameters mentioned 52 above. Clearly, for any T ∈ B(n) we have: (2) Thus, Eqn.

(1) translates to the identity n−3 i=1φ T (i) = 1. Thusφ T (i) is the probability 56 that a pair on edges (selected uniformly at random from all pairs) has exactly i edges lying 57 on the path strictly between the two selected edges.

Next we state a simple combinatorial result that will be useful in the proof of the 59 first theorem.

Lemma 1 Let f be a binary character that has evolved on T by 2 edge changes on e 1 and 61 e 2 , and let f be a character that has perfectly evolved on T by a change on a single does not lie on the path in T that is strictly between e 1 and e 2 (i.e. the red edges in Fig. 1) .

these two partial splits are incompatible, so too are f and f .

Next suppose that e lies in one of the green subtrees of Fig. 1 (C-iii) the split described by f is compatible with k characters that are perfectly evolved 89 on T (by the Markovian process described above).

The point of these conditions is that we might add the split corresponding to f to 91 the other k compatible spits which still be compatible, but create a 'false positive' split in 92 the resulting tree (note that we do not know a-priori which splits are the perfectly evolved 93 ones, as we are assuming that T is not known).

We now present an exact expression for Φ 

Thus R T (e 1 , e 2 ) is the proportion of interior edges of T that lie between two 98 non-adjacent edges e 1 and e 2 . Next, let µ T denote the average value of d T (e, e ) over all pairs of edges {e, e } 100 sampled from T . That is:

. 

T is the expected value of

(1 − R T (e 1 , e 2 )) k when a pair of edges of T is selected uniformly at random from the set of 126 all such pairs. Since f (x) = (1 − x) k is a convex function, Jensen's inequality then gives:

(1 − 0) k , since there are 3(n − 2) pairs of adjacent edges in T , and 1 3 · 3(n−2) ( 2n−3 2 ) = 1 (2n−3) , and so applying this to the Eqn. 5 gives the result claimed. T for a given tree T with n leaves involves a 136 summation of just O(n 2 ) terms, so could be calculated fairly easily.

Notice also that in the special case where k = 1, Theorem 2(i) simplifies (via 138 Eqn.

(1)) as follows.

Corollary 1

Exact values of Φ (k)

T for classes of trees 140 Proposition 1 8 (i) Let T n denote a caterpillar tree with n leaves. Then Φ T 4 = 0 and for n 5:

Thus, for fixed k we have:

where o(1) is a term that tends to 0 as n grows.

(ii) Let T h denote any tree in B(2 h + 1) obtained by attaching a leaf to the root of a 143 complete balanced binary tree of height h. Then

where:

(iii) Let D be a probability distribution on B(n). Then

In particular, when D is the PDA distribution on B(n) 

For h 1 and i 1, let N h i denote collection of pairs of edges in T h that are separated by exactly i edges, and let n

Notice also that deleting the stem edge of T h+1 (and its incident vertices) produces two 154 copies of T h -we will call these L and R ('left' and 'right').

The set N h+1 i can be partitioned into three classes:

Class 1: A pair of edges consisting of an edge of T h+1 at distance i from the stem 157 edge, together with the stem edge. This set has size u

is as given 158 above.

Class 2: A pair of edges that lie entirely within L or entirely within R. This set has 160 size 2 · n (h)

i .

Class 3: A pair of edges {e 1 , e 2 } with e 1 in L and e 2 in R with the distance between 162 e 1 and e 2 being i.

A little care is required to enumerate Class 3. One subcase is that e 1 is not the stem edge of L and e 2 is not the stem edge of R. In that case, the number of choices is

which equals the expression Q(i, h) given in Part (ii). The complementary subcase where e 1 or e 2 is the stem edge of L or R contributes

Combining the above gives the recursion:

which simplifies to:

or equivalently,

Solving this recursion (noting that u 

The expression for E[φ T (i)] involves an argument in enumerative combinatorics. Let N (n, k) denote the number of (unordered) forests consisting of k rooted binary trees whose leaf sets are disjoint and contain a total of n leaves (we allow a single labeled leaf to be a rooted tree of size 1, otherwise the root of each tree has degree 2). It is easily seen that N (n, 2) is precisely the number of rooted binary trees (2n − 3)!!, since deleting the root of such a tree, produces a forest of two trees with disjoint leaf sets. It turns out there is an exact formula for N (n, r) (from ∼1990):

N (n, r) = (2n − r − 1)! (n − r)!(r − 1)!2 n−r , for r = 1, . . . , n (see e.g. (3), p. 105).

An ordered forest is a forest with a linear ordering on its component trees. If O(n, r) denotes the number of ordered forests consisting of r trees then O(n, r) = r!N (n, r) = r (2n − r − 1)! (n − r)!2 n−r .

Notice that:

To see this observe that twice the right-hand side counts the number of pairs (T, (e 1 , e 2 )) 169

where T ∈ B(n) and (e 1 , e 2 ) is an ordered pair of non-adjacent edges. Notice that if we 170 delete the path connecting e 1 and e 2 we obtain an ordered forest of rooted trees on the 171 same leaf set as T (cf. Fig. 1 ) . This provides a bijection between these two sets in which 172 the number of strictly edges between e 1 and e 2 in T is equal to i − 3 where i is the number 173 of forests.

Thus,

and rearranging, gives the result.

To illustrate Part (i), with n = 5 we have Φ T 5 = 8 63 ( 1 2 ) k . Notice that this converges 177 to zero exponentially fast with k (and in general for fixed n this will be the case).

Also, observe that for general values of n and with k = 1 (say) we have:

Thus, for the simulation involving trees with n = 500 leaves, if these were caterpillars 179 (instead of YH trees), for k = 1 we would expect Φ

T to be very close to 2 9 (in agreement 180 with the asymptotic claim in the Part (i) of Proposition 1) which is lower than the 181 simulated values on YH trees, as expected.

To illustrate Part (ii), ϕ T 3 (2) = P (2, 3) + Q(2, 3) = 2 4 (2 1 − 1) + 2 2 (2 + 1) = 28.

Notice also that for general h we obtain (as expected)

and ϕ T h (i) = 0 for i > 2h − 2.

Notice that for h large, P (i, h) is negligible relative to Q(i, h).

A lower bound on the expected value of Φ for all n 4 we have:

Moreover,

where o(1) is a term that tends to 0 as n grows.

(iii)

where o(1) is a term that tends to 0 as n grows, and c is a constant (independent of 189 n). 

The first step is to apply a classic technique in enumerative combinatorics; namely, 193 we count a certain set Ω in two different ways, to obtain an equation. Here, we take the set corresponds to an interior edge of T , and {e 1 , e 2 } ∈ N A(T ) with the edge of T 196 corresponding to split σ being strictly within the path connecting e 1 and e 2 .

Summing first over all choices of T we have:

and so,

where |C T | = 2n−3 2 − 3(n − 2) is the number of pairs of non-adjacent edges in T .

On the other hand, we can count |Ω| by first summing over all choices of the split σ 201 (stratified by the size k of the smaller half of the split) and then counting for each such σ 202 the number of T and {e 1 , e 2 }. This gives:

where rb(m) = (2m − 3)!! is the number of rooted trees on a leaf set of size m.

To see this, note that n k is the number of ways to partition [n] into a split σ of two 205 sets of size k and n − k, and 1 2 rb(k)rb(n − k) is the number of trees in B(n) that contain 206 this split (the factor 1 2 is to remedy the double counting that occurs). Finally, given T and 207 σ there are now (2k − 2)(2(n − k) − 2) choices of {e 1 , e 2 } where e 1 is an edge of the rooted 208 subtree of T on the leaf set of size k and e 2 is an edge of the rooted subtree of T of size 209 n − k.

We will apply generating function techniques to calculate an exact expression for term on the right hand side of Eqn. (13). Notice can rewrite Eqn. (13) as follows:

where [x n ]f (x) denote the coefficient of x n in f (x), and hence, even more compactly,

Let rb(n) = (2n − 3)!! (the number of rooted binary trees on a leaf set of size n) and 213 let ν(x) = n 1 rb(n)

n! x n denote the associated exponential generating function. It is well 214 known (see e.g. (3)) that

which has the unique solution

We will use the following relationships which follow easily from Eqn. (16):

where ν (x) = d dx ν(x).

Recall, F (x) from Eqn. (15). We have:

Thus,

and this simplifies by Eqns. (18) and (19)) to:

In order to determine the expression in Eqn. 

For the third term of Eqn. (20), we have:

Substituting the expressions on the right hand side of Eqns. (18), (19) and (20) 

The second claim in Part (i) of Theorem 2 follows from the asymptotic equivalence: rb(n) n! ∼ 1 2 √ π 2 n n −3/2 together with some standard algebraic manipulation.

Part (ii): We apply Theorem 2(Part (ii)) to Part (i) of the current theorem. WIth k γ √ n we obtain:

Part (iii): We apply Proposition 4 of (1) (where β = 0 corresponds to the YH 233 distribution). This result shows that in a tree T ∈ B(n) sampled according to the YH 234 distribution the maximum distance D from any leaf to the root is less than 235 (4.31 + ) log(n) with a probability converging to 1 as n grows. Now µ T is always less than 236 twice the distance from any leaf in T to any other leaf in T , and this inter-leaf distances is 237 bounded by 2D. The result now follows from Theorem 2(Part (ii)).

Remarks:

Note that Parts (ii) and (iii) imply that for each fixed k,

both converge to 1 3 as n grows, in contrast to the caterpillar case of Part (i) of 241 Proposition 1, where the limit is smaller (e.g. for k = 1, the limit is 2 9 ).

Notice also that the lower bounds on E D [Φ 

Conditional on K = k, the value p T is simply given (ignoring terms involving λ of 260 higher than quadratic power) by:

In particular,

We have

Provided that λm << 1 we may approximate this last equation by:

Thus, provided that m large and λm < 1 and using using µ K to estimate K (which is reasonable since m is large, and we are also assuming above that λn << 1) we obtain:

and solving for λ in this equation, and substituting the estimate of λ into Eqn. (26) gives:

T .

Let us now multiply through by m (the number of characters) to get.

The term on the left (mp T ) has a natural interpretation -it is simply the expected 264 number of characters that have evolved on T by 2 edge changes and that also satisfy the 265 three conditions (C-i)-(C-iii) above. By Equation (27), this can be approximated by

T . Note that this quantity is independent of λ (provided this is sufficiently small 267 that the above approximations are reasonable), and n also is not involved in the first term 268 in the product.

Note that one cannot interpret mp T as the expected number of false splits in a 270 maximum compatibility tree involving the K perfectly evolved characters and the additional characters with 2 changes. This is because some of these latter characters may (c 1 ) P ⊆ P or P ⊆ P .

(c 2 ) e does not lie in P and e does not lie in P .

Proof: The proof involves a case analysis. In particular, we show that:

(1) if (c 1 ) holds then f and f are compatible;

(2) if (c 1 ) fails but (c 2 ) holds, then f and f are compatible; and

(3) if (c 1 ) and (c 2 ) both fail to hold, then f and f are incompatible.

To simplify notation we first introduce some conventions. We will let σ (respectively 293 σ ) denote the split of [n] induced by (reversing state) changes on e 1 and e 2 (respectively on e 1 and e 2 ). For subsets V 1 , . . . , V r of [n] we write V 1 · · · V r |− as shorthand for the split 295 (V 1 ∪ · · · ∪ V r )|([n] − (V 1 ∪ · · · V r )), and if a subtree within T has leaf set A ⊂ [n] we will 296 denote this subtree by writing t(A). We will also assume that e 1 , e 2 , e 1 , e 2 are four distinct 297 edges (we deal with the case where this fails at the end). Case (1): Suppose that (c 1 ) holds. Without loss of generality we may assume that 299 P ⊆ P and that the order of the path from e 1 to e 2 passes through e 1 and e 2 in this order.

Thus we can represent T as in Fig. 2(i) where A, B, C, D, E denote the (unions of the) leaf 301 sets of the corresponding subtrees determined by this arrangement of the four edges.

Thus σ is the split ABDE|C, and σ is the split AE|BCD. Since the second half of 303 he first split (namely C) is a subset of the second half of the second split (BCD) these two 304 splits are compatible. Note that this argument also holds if one or both of the following 305 identities applies: e 1 = e 1 or e 2 = e 2 .

Case (2): Now suppose that (c 1 ) fails but (c 2 ) holds. In this case, by considering the 307 path between e 1 and e 2 we can represent T as in Fig. 2 Finally, we assumed that e 1 , e 2 , e 1 , e 2 are four distinct edges. Otherwise (since 326 e 1 = e 2 and e 1 = e 2 ) we either have: {e 1 , e 2 } = {e 1 , e 2 }, in which case condition (c 1 ) holds, 327 and the two characters f and f induce the same split and so are compatible, or we may 328 assume without loss of generality that e 1 = e 1 and e 2 = e 2 . A similar (though simpler) case 329 analysis to the above leads to the same conclusions as before. In general, it is mathematically complicated to describe FP T in terms of these 352 parameters. However, when the number of leaves in a tree grows faster than the number of 353 perfectly compatible characters, it is possible to state a limit result, in order to provide an 354 approximation to FP T for large trees.

In the following theorem we consider the following setting: 356 I. mξ = Θ(n β ) for some 0 < β < 1 2 , and 357 II. mξ 2 = O(1), 358 where O(1) refers to dependence on n (thus mξ 2 is not growing with n). Note that 359 Condition (I) implies that the number of perfectly-evolved characters grows with the 360 number of leaves, but at a rate that is slower than linearly. When β 1 2 , Condition (I) also provides a positive probability that more than one perfectly-evolved character will give rise 362 to the same non-trivial split. Conditions (I) and (II) imply that ξ decreases as n increases. 363 We will show that in this setting, the false positive rate is (asymptotically) of the form ξ 3 times a function Ω that involves T (via its shape), m and ξ. To describe this result, we need to define this function Ω. Let

where µ = 1 2 mξ and whereφ Tn (i) is given in Eqn.

(2). For example, for any caterpillar tree, we have 364φ

).

Notice that Ω(T n , ξ, m) depends on T n only via the coefficientsφ Tn (i), and this 366 dependence in linear. Thus, if D is a distribution on trees (e.g. the PDA or YH) then the 367 expected value of Ω(T n , ξ, m) is given by:

For the PDA distribution, the term E P DA [φ T (i)] has an explicit exact value, given 369 by Eqn. (9).

Theorem 3 For each n 1, let T n be a binary phylogenetic tree with n leaves, and 371 suppose that Conditions (I) and (II) hold.

(i)

where o(1) is a term that tends to 0 as n grows. and so has expected value 1 2 mξ(1 + o(1))). As before, let K denote the number of distinct 378 non-trivial splits of T n that correspond to (one or more) of these perfectly evolved 379 characters. By Condition (I) it follows that each perfectly evolved character is (with 380 probability → 1) generated at most once, and so K is approximated by X 1 , which in turn 381 is approximated (for large n) by a Poisson distribution with mean µ = 1 2 mξ.

Consider now the splits that arise from 2-edge-change characters on T . Denote the 383 number of false splits by X 2 , and the number of true splits by X 2 . The expected value of 384 X 2 is of order mξ 2 /n and so it converges to zero as n grows, by Condition (II). By

Proposition 3 and Condition (II), the probability that every pair of the X 2 false splits is 386 compatible (with each other) tends to 1 as n grows.

Next, consider splits (true or false) that arise from characters involving 3 or more 388 changes on different edges. If X 3 denotes the number of such characters, then the expected 389 value of X 3 is bounded above by a constant (independent of n) times mξ 3 , and by 390 Conditions (I) and (II), the ratio of this to K is of order ξ 2 (independent of n).

Thus,

where K has a Poisson distribution with expected value µ = mξ 2 , and O(1) refers to 393 a term that is bounded in n (by Condition (II)) and accounts for any non-trivial splits 394 induced by characters that are not perfectly evolved and in the reconstructed tree (as well 395 as for splits from perfectly evolved characters that are 'lost' in a strict consensus by being 396 the only such split in a path between the two edges of a 2-change character), while o(1) 397 refers to a term the tends to zero as n grows due to characters that involve 3 or more edge changes).

Thus, P(K = k) = e −µ µ k /k!, and so, under Conditions (I) and (II) we have:

where ρ := 1 − i n−3 . We now apply the identity: ∞ k=0

x k k!(k+1) = e x −1 x , with x = ρµ to obtain:

Since µ = mξ/2, notice that we can write the term ξ 2 m 2 in Eqn. (29) as ξµ and so, from 401 Eqns. (29) and (30), we have:

Finally, canceling the term µ in the numerator and denominator on the right-hand-side of 403 Eqn. (31) gives the expression in Part (i).

Part (ii): This follows directly from Part (i) by linearity of expectation.

Part (iii): When T n is a caterpillar tree we have:

for 1 i n − 3. We can rewrite this as:φ Tn (i) = 4 2n−3 · 1 − i n−2 , and substituting this into Part (i) we obtain: (1)).

As n → ∞ we have the asymptotic identity:

where x := e −µ/(n−3) .

Now, x converges to 1 from below as n grows (under Conditions (I) and (II)), and 408 so applying the identity ∞ i=1 x i = x/(1 − x) (for 0 < x < 1), together with the identity 1 − x ∼ µ/(n − 3) ∼ µ/n = mξ/2n (since µ = 1 2 mξ) gives:

noting that the term (2ξ/3)e −µ is asymptotically negligible relative to the term it follows where ps(T, D) is the parsimony score of the tree T for the data. We will assume that 426 ps(T, D) << m.

Now, we can write the likelihood function for T as follows:

428 L(T |D) = Q m−n 0 0 · Q m 1 · q n 1 1 q 2n 2 2 · · · q kn k k ,

where Q 0 = ν 2n−3−ps(T,D) , Q 1 = k i=1 (1 − p i ) in i , q i = p i 1−p i . Clearly, L(T |D) is maximised by setting ν = 1 (regardless of the other parameters). Moreover, the log likelihood critical 

T is a natural quantity to compare with the earlier Φ 

The ability of single genes vs full genomes to resolve time REFERENCES

Public REFERENCES

Probability distributions on cladograms

Springer: IMA Volumes in 463 Mathematics and its Applications 76

Maximum likelihood inference of phylogenetic trees, with special 465 reference to a Poisson process model of DNA substitution and to parsimony analysis