key: cord-0288144-oviyk2j2
authors: Ye, Yongtao; Shum, Marcus H.; Tsui, Joseph L.; Yu, Guangchuang; Smith, David K.; Zhu, Huachen; Wu, Joseph T.; Guan, Yi; Lam, Tommy T.
title: Robust expansion of phylogeny for fast-growing genome sequence data
date: 2022-01-03
journal: bioRxiv
DOI: 10.1101/2021.12.30.474610
sha: e960d7625b0f3a46928c72bbb72bf7a5f6663816
doc_id: 288144
cord_uid: oviyk2j2

Massive sequencing of SARS-CoV-2 genomes has led to a great demand for adding new samples to a reference phylogeny instead of building the tree from scratch. To address such challenge, we proposed an algorithm ‘TIPars’ by integrating parsimony analysis with pre-computed ancestral sequences. Compared to four state-of-the-art methods on four benchmark datasets (SARS-CoV-2, Influenza virus, Newcastle disease virus and 16S rRNA genes), TIPars achieved the best performance in most tests. It took only 21 seconds to insert 100 SARS-CoV-2 genomes to a 100k-taxa reference tree using near 1.4 gigabytes of memory. Its efficient and accurate phylogenetic placements and incrementation for phylogenies with highly similar and divergent sequences suggest that it will be useful in a wide range of studies including pathogen molecular epidemiology, microbiome diversity and systematics.

research groups, sequencing quality varies and ambiguity bases often occur in the consensus 139 genome sequence data, which could affect the placement accuracy. To account for ambiguity data 140 in sequencing, we used a specific substitution scoring table based on the IUPAC nucleotide 141 ambiguity codes (table S3) for the taxon placement and insertion process (details in 142 Methodology), which achieved a robust performance in sequences of different qualities. 143 144 Notably, when searching through the whole phylogeny for the best position to place a taxon, there 145 may be cases where multiple branches achieve equal minimum substitution scores, and thus the 146 placement will be uncertain. As demonstrated in Fig. 1C , TIPars produced the least number of 147 multiple ambiguously optimal placements in all testing datasets. For example, TIPars generated 148 23% fewer multiple placements than UShER in the SARS2-100k dataset. 149 150 A possible reason for the relatively poor performance of EPA-ng could be that RF distance may 151 not be a reliable metric to compare binary trees derived from the phylogeny with polytomy 152 because there is a very skewed distribution of RF distance when comparing two random binary 153 trees (Bryant & Steel, 2009 Cumulative proportion of single taxon placement result with different RF distance cutoff was 166 shown in Fig. 1D . TIPars produced trees with significantly higher topological similarity to the 167 reference tree with a median RF distance of 0.5 and mean of 5.8 (99% confidence interval (CI) = 168

[5.5-6.1]) as compared to UShER (median RF distance is 3.0 and mean is 31.2 (99% CI = [30.0-169 32.4])) at 99% significance level (p-value < 10 -10 ). In the SARS2-100k dataset, we performed multiple taxa insertion for 100 sets of 10 2 and 10 3 179 randomly selected sequences (an example is shown in Fig. 2A ) (random100 and random1000) 180 and 100 sets of 10 2 and 10 3 successively selected sequences (i.e., a set of successive taxa 181 following the tree postorder traversal; an example is shown in Fig. 2B ) (successive100 and 182 successive1000). In the 16S, H3N2 and NDV datasets, 100 sets of 50 sequences were randomly 183 selected. The selected sequences are pruned from the corresponding reference tree and become 184 multiple taxa to be reinserted for each testing set. 185 186 RF distance and tree log-likelihood (LL) were used to evaluate the performance of the multiple 187 sequence insertion. To evaluate the topology accuracy, the resulting tree produced by the four 188 programs were compared to the original reference tree (leaf taxa unpruned) to obtain the RF 189 distance. At the same time, Gamma20 log-likelihoods of the reference tree and the resulting tree 190 after optimizing the branch length were also computed using FastTree2 (double-precision version) 191 and their differences were used for evaluation. 192

For the random100 and random1000 datasets, only analyses using TIPars and UShER were able 194 to complete within a reasonable computation time, hence no result from IQ-TREE2 and PAGAN2 195 was present. The resulting trees from multiple taxa insertion using TIPars had a significantly 196 smaller RF distance than those generated using UShER (Fig. 3A ). In addition, the log-likelihood 197 of the resulting trees from TIPars was significantly higher than that of UShER (Fig. 3B) . UShER with a significantly lower RF distance and a higher log-likelihood of resulting trees ( Fig.  205 3, E to H; fig. S3 ). In the H3N2 dataset, there was no significant tree log-likelihood difference 206 between TIPars and UShER (Fig. 3G) , and in NDV dataset, TIPars performed better than IQ-207 TREE2 with higher mean log-likelihood but without statistical significance (Fig. 3H) . The 208 demonstrations of the taxa-insertion result were visualized in Fig. 2C where UShER, IQ-TREE2 209 and PAGAN2 were less accurate than TIPars. 210

For the successive100 and successive1000 datasets, TIPars resulting trees had a significantly 212 larger RF distance than those of UShER (Fig. 3C) . However, the log-likelihood of the TIPars 213 resulting trees was significantly higher than that of UShER ( Fig. 3D; fig. S2 , C and D). By 214 comparing the trees generated from TIPars and UShER (Fig. 2B) , the difference is that TIPars 215 inserted some of query taxa (green lines in Fig. 2B ; successive taxa pruned from the reference 216 tree) into two subtrees where one of them (the one containing over half the queries) had the same 217 topology as the one in the reference tree. Whereas UShER inserted those queries mostly within a 218 monophyletic clade but it was different from the reference tree. As a result, UShER retained the 219 local topology (better RF distance) (Lin et al., 2012; Smith, 2021) but missed the global topology 220 (worse log-likelihood). Through a RF distance comparison specifically to each query taxon 221 instead of all query taxa, we found that the RF distance resulted from UShER was not 222 significantly higher than that of TIPars (table S4) . 223

On the other hand, we may suppose that in the situation of random100 and random1000 tests, RF 225 distance would be a suitable metric for comparing the performance of taxa insertions as they are 226 similar to the case of single taxon placements, where most removed taxa are within different 227 monophyletic clades due to randomness (Bryant & Steel, 2009) . 228

To make the log-likelihood of the resulting trees comparable, we applied FastTree2 to reoptimize 230 the branch lengths with fixed topology (Price et al., 2010) . However, compared to the efficiency 231 of taxa insertion (Table 1) , the re-optimization is time-consuming. For example, the optimization 232 for a SARS2-100k tree took 10 to 12 hours and required around 125 GB memory (table S5) . 233 Therefore, we also computed the log-likelihoods with fixed branch lengths (FLL) using IQ-234 TREE2, and TIPars still outperformed UShER significantly ( fig. S4 ) by achieving a higher log-235 likelihood in the resulting tree output directly from the program. 236 237

To verify practicability of TIPars in adding novel sequences into a given phylogeny, we further 239 performed an experiment to insert novel real-world SARS-CoV-2 samples into the SARS2-100k 240 reference tree. We randomly selected SARS-CoV-2 samples from GISAID which were not 241 included in the SARS2-100k dataset. Twenty sets of 100, 1000, 5000 and 10000 genome 242 sequences were generated as the queries for taxa insertion using TIPars and UShER. 

TIPars showed promising taxa placement and insertion accuracy in the phylogenies with 264 homogenous (H3N2 and SARS2-100k) and divergent (16S and NDV) sequences, and in 265 extremely large phylogeny (SARS2-660k) with reasonable runtime and memory usage. Although 266

UShER has a lower accuracy in the divergent sequence datasets (16S and NDV), it ran faster than 267

TIPars (Table 1) Although we showed that TIPars resulting trees with higher tree log-likelihood compared to other 284 programs, a general limitation of the phylogenetic placement method is that errors from incorrect 285 placements accumulate as multiple sequences are inserted sequentially. In order to minimize the 286 error due to large numbers of sequence insertions, it is suggested to conduct tree refinements on 287 not only branch length but also tree topology using different techniques such as nearest-neighbor 288 interchanges (NNIs) and subtree-pruning-regrafting (SPRs) (Price et al., 2010) . Furthermore, 289 starting such optimization process with an initial tree of higher log-likelihood may achieve a final 290 tree with better log-likelihood using certain of time (Price et al., 2010) . As demonstrated in table  291 S7, for the resulting trees of equal RF distance from both TIPars and UShER (n=28), the branch 292 length optimized trees for TIPars had higher (n=14) or equal (n=12) tree log-likelihoods than the 293 ones resulted from UShER. 

After assigning the ancestral sequences at every internal node and taxa sequences at external 314 nodes, TIPars inserts a set of new samples into the reference phylogenetic tree sequentially based 315 on parsimony criteria. 316 317 For a query sequence Q, TIPars computes the minimal substitution score against every branch in 318 the tree. While inserting query Q into to the branch A-B (parent node -child node) at a potential 319 newly added node P (Fig. 1A) , the minimal substitution score is the sum of substitution scores 320 that sequence Q differs from both sequence A and sequence B based on a specific substitution 321 scoring table based on the IUPAC nucleotide ambiguity codes (table S3). The single branch with 322 the minimum substitution score  is reported as the best placement. 323 324 However, in terms of multiple placements where more than one branch have the same minimum 325 substitution score, TIPars applies simple but practical rules to filter them to a single best 326 placement such that multiple queries would be inserted sequentially based on one resulting tree. 327

The first priority is to select the branch with node A containing the most numbers of child nodes. 328

The second priority is to select the branch with node A of the lowest node height, that is the total 329 branch length on the longest path from the node to a leaf (Suchard et al., 2018). Finally, in the 330 case where the ambiguity cannot be resolved by the first two priorities, TIPars just turns to a pick 331 up randomly. Even though TIPars will filter out multiple placements, these potential placements 332 will also be printed out for user notice. CoV-2 viral genome sequences collected before 1st January 2021 were extracted from the MSA. 356

In order to ensure the sequences used for downstream analysis were complete, SARS-CoV-2 357 genomes with sequence length < 29,000 bp and > 0.5% Ns were removed (namely 358 GISAID202101). To ensure that the global phylogenetic diversity is well represented in the sub-359 sampled dataset, sequences from all lineages as designated by the PANGO nomenclature system 360 (Rambaut et al., 2020) were sub-sampled. Where fewer than 50 sequences of a given lineage were 361 found in the global dataset, all sequences of the lineage were included. This resulted in a final 362 sub-sampled dataset of 96,020 sequences from 1,249 PANGO lineages, with hCoV-19/Wuhan/WIV04/2019/EPI_ISL_402124 included as the reference genome (namely SARS2-364 100k). The SARS2-100k reference tree was then built using IQ-TREE2 with GTR model using 365 the EPI_ISL_402124 as root. Ancestral sequences of each internal node were estimated using 366

PastML with the MSA and the IQ-TREE2 generated tree as input. were not able to complete the computation within a reasonable runtime (Table 1) 

For single taxon placement evaluation, we first pruned one taxon from the reference tree and re-415 inserted it back. To assess the consistency between placement algorithms and the typical tree-416 constructing approach, we proposed using Robinson-Foulds (RF) Distance as a measure of the 417 tree topology accuracy, as calculated by TreeCmp (Bogdanowicz, Giaro, & Wróbel, 2012) . When 418 the RF distance between a hypothetical tree and the reference tree is zero, the topology of the 419 hypothetical tree is the same as the reference tree which means the algorithm inserts the query 420 sample into the reference tree topological correctly. Another performance comparison with 421 different true positive definition was conducted for binary trees derived from trees with polytomy 422 using the measurement of whether sister node sets are identical to reference (Y. Turakhia et al., 423 2020) . 424 425 For multiple taxa insertion evaluation, we randomly pruned a set of taxa from the reference tree 426 and re-inserted them back. In addition to using RF distance to compare the hypothetical tree 427 against the reference tree, we also calculated the log-likelihood of the hypothetical tree as a 428 measurement of the accuracy of the taxa insertions. We applied two methods to compute log-429 likelihoods including FastTree2 (double-precision version) (Gamma20 Log-Likelihood) ( 

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences

TreeCmp: Comparison of Trees in Polynomial 482 Time

Computing the Distribution of a Tree Metric

MUSCLE: a multiple sequence alignment method with reduced time and 486 space complexity

Phylogenetic 488 placement of metagenomic reads using the minimum evolution principle

A Fast Likelihood Method to 491 Reconstruct and Visualize Ancestral Scenarios

MAFFT Multiple Sequence Alignment Software Version 7: 494 Improvements in Performance and Usability

Hypothesis Testing Using Phylogenies

A Metric for Phylogenetic Trees Based on 501

An algorithm for progressive multiple alignment of 504 sequences with insertions

Accurate extension of multiple sequence 507 alignments using a phylogeny-aware graph algorithm

pplacer: linear time maximum-510 likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

A Daily-Updated Database and Tools for Comprehensive SARS-514

CoV-2 Mutation-Annotated Trees

IQ-TREE 2: New Models and Efficient Methods for Phylogenetic 518 Inference in the Genomic Era

2019//). The Cluster Affinity Distance for Phylogenies

Genomes OnLine database (GOLD) v.7: updates and new features

ape 5.0: an environment for modern phylogenetics and 526 evolutionary analyses in R

Visualizations with statistical details: The 'ggstatsplot' approach

FastTree 2--approximately maximum-likelihood 531 trees for large alignments

A fast algorithm for joint reconstruction of 533 ancestral amino acid sequences

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic 537 epidemiology

Comparison of phylogenetic trees

Reversible 541 polymorphism-aware phylogenetic models and their application to tree inference

GISAID: Global initiative on sharing all influenza data -from 544 vision to reality. Euro surveillance : bulletin Europeen sur les maladies transmissibles = 545

Illustration of the placement for a query sequence. "Q" indicates the query sequence, "A" and "B" 581 represent the existing nodes in the reference tree. "P" represents the parental node of "Q" 582 generated by TIPars. Minimum substitution score is calculated based on the triplet formed by A-