key: cord-338081-ggw5l1qm authors: Gorbalenya, Alexander E.; Lauber, C. title: Phylogeny of Viruses date: 2017-06-26 journal: Reference Module in Biomedical Sciences DOI: 10.1016/b978-0-12-801238-3.95723-4 sha: doc_id: 338081 cord_uid: ggw5l1qm Biological species, including viruses, change through generations and over time in the process known as evolution. Viruses may evolve at high, uneven, and fluctuating rates among genome sites. The accumulated changes, through either mutation or recombination with other species, are first fixed in the genome of successful individuals that give rise to genetic lineages. The relationship between biological lineages related by common descent is called ‘phylogeny’. For inferring phylogeny, the differences between aligned sequences of genomes and proteins are quantified and depicted in the form of a tree, in which contemporary species and their intermediate and common ancestors occupy, respectively, the terminal nodes, internal nodes, and the root. The tree is characterized by a topology, length of branches, shape, and the root position. A complex mathematical apparatus has been developed for phylogeny inference that can evaluate inter-species differences, facilitate tree building and comparison of trees, and assess the fit between data and tree through, typically, computationally intensive calculations. A reconstructed tree is an approximation of the true phylogeny that practically remains unknown. The phylogenetic analysis is used in applied and fundamental virus research, including epidemiology, diagnostics, forensic studies, phylogeography, evolutionary studies, and virus taxonomy. It can provide an evolutionary perspective on variation of any trait that can be measured for a group of viruses. Our knowledge about contemporary virus diversity has been steadily advancing with new viruses being constantly described by systematic efforts as well as occasional discoveries. These developments indicate that only a small part of virus diversity has so far been unraveled and has become available for phylogenetic studies. It is also likely that many more lineages existed in the past; some of these lineages are likely to have ancestral relationships with contemporary lineages. Species share similarity that varies depending on the rate of evolution and time of divergence. The entire process of generating contemporary species diversity from a common ancestor is believed to proceed through a chain of intermediate ancestors specific for different subsets of the analyzed species (Fig. 2) . Typically, these ancestral sequences are estimated internally during the tree building process or are not required at all, depending on the method used. The relationship between the common ancestor, intermediate ancestors, and contemporary species may be likened to the relationship between, respectively, root, internal nodes, and terminal nodes (leaves) of a tree, an abstraction that is widely used for the visualization of this relationship (Fig. 2) . Alignment of the contemporary sequences with the reconstructed tree side by side, like shown for the toy example in Fig. 2 , may reveal the full chain of sequence changes that have happened during evolution which, however, is rarely the case for real data sets due to repeated substitutions and incomplete species sampling. Trees are also part of graph theory, a branch of mathematics, whose apparatus is used in phylogeny. Formally and due to a strong link between phylogeny and taxonomy, leaves may be called operational taxonomy units (OTUs) and internal nodes and roots, since they have not been directly observed, are known as hypothetical taxonomy units (HTUs) . Nodes are connected by branches or edges. The tree may be characterized by topology, length of branches, shape, and the position of the root (Fig. 3) . The topology is determined by relative positions of internal and terminal nodes; it defines branching events leading to contemporary species diversity. If two or more trees obtained for different data sets feature a common topology, these trees are called congruent. The branch length of a tree may define either the amount of change fixed or the time passed between two nodes connected in a tree, and is known as 'additive' or 'ultrametric', respectively ( Fig. 3B and C). The tree shape may be linked to particulars of the evolutionary process and reflect changes in population size and diversity due to genetic drift and natural selection. The position of the root at the tree defines the direction of evolution. Species that descend from an internal node in a rooted tree form a lineage (cluster) and the node is called most recent common ancestor (MRCA) of the lineage that thus has a monophyletic origin (Fig. 2) . The branch lengths and the root position may be left undefined for a tree that is then called 'cladogram' and 'unrooted tree', respectively (Fig. 3A ). Multiple alignments of polynucleotide or amino acid sequences representing analyzed species and maximized for similarity are traditionally used as input for phylogenetic analysis. The quality of alignment is among the most significant factors affecting the quality of phylogenetic inference. Due to the redundancy of the genetic code, changes in polynucleotide sequences are accumulated at a higher rate than those in amino acid sequences. In viruses, including RNA viruses, this difference is not counterbalanced by other local or global constraints on variation of genomes that are linked to e.g. di-nucleotide frequency or RNA secondary (tertiary) structure. Because of these differences, polynucleotide sequences are commonly used for phylogeny reconstruction of only those species that are closely related, while protein sequences, preserving better phylogenetic signal, may be used to infer phylogeny of distantly related species. Differences between species, as calculated from alignment, may be quantified as either pairwise distances forming a distance matrix or position-specific substitution columns (discrete characters of states of alignment), the latter preserving the knowledge about location of differences. The respective methods dealing with these quantitative characteristics are known as distance and discrete (character state). The distance methods are praised for their speed and are considered a technique of choice for analysis of very large data sets, although character state methods caught up in this respect due to recent algorithmic advancements (see also below). Distance methods are often designed to converge on a unique phylogeny by clustering, with none others being even considered. The unweighted pair group method with arithmetic means (UPGMA) in which a constantly recalculated distance matrix is used to define the hierarchy of similarities through systematic and stepwise merging of most similar pairs at a time was the first technique introduced for clustering. The neighbor-joining (NJ) method uses a more sophisticated algorithm of clustering that minimizes branch lengths, and is the most popular among distance methods. Although different trees may be compared in how they fit a distance matrix, it is character-based methods that are routinely used to assess numerous alternative phylogenies in search for the best one in a computationally intensive process. Due to the calculation time involved, assessing all possible phylogenies is found to be impractical for data sets including more than 10 sequences; for larger data sets different heuristic approximations are used that may not guarantee a recovered phylogeny to be the best overall. There are two major criteria for selecting the best phylogeny using character-state based information through either maximum parsimony (MP) or maximum likelihood (ML). In MP analysis, a phylogeny with a minimal number of substitutions separating the analyzed species is sought. The ML analysis offers a statistical framework for comparing the likelihood of fitting different trees to the data under competing models of evolution with parameters including population size change and rate of mutation in search for one with the best fit. The latter approach is mathematically robust and its statistical power may also be used in combination with other techniques of tree generation. Recently, a Bayesian variant of the ML approach has gained popularity. It can utilize prior knowledge about the evolutionary process, like known substitution rates or clustering of species subsets or dates of species isolation, in combination with repeated sampling from subsequently derived hypotheses. The result of a Bayesian analysis is thus a forest of trees that reflects the uncertainty associated with the reconstructed phylogeny and which forms the basis to derive a consensus tree and statistic support for its branches. In phylogenetic analysis of viruses the dates of species isolation are often used to date the MRCA of the analyzed viruses under a Bayesian framework, while fossil information is routinely used to time-calibrate trees of cellular organisms. Bayesian methods have the highest computational cost due to their sampling approach and thus show the lowest speed, while realization of the similarly advanced ML algorithm may be largely comparable in speed to distance methods, allowing for the phylogenetic analysis of very large data sets like genome-wide tree reconstructions of cellular organisms or thousands of viruses. One should keep in mind that different methods for phylogeny reconstruction can produce different trees, concerning both topology and branch lengths, for the same data set, although better agreement between ML and Bayesian trees is common, especially in respect to branch lengths (Fig. 4) . None of the methods is considered superior to the other methods with respect to all aspects of phylogeny reconstruction, and which method to use under what circumstances is often a point of debate. A valid approach to gain further confidence in phylogenetic results is to apply several methods on the data and to only trust HTUs that are inferred by more than one method. After a tree is chosen, it is common to assign support values to internal nodes through assessing the nodes' persistence in trees related to the chosen tree. One particular technique, called bootstrap analysis, in which trees are generated for numerous randomly modified derivatives of the original data set, is most frequently used in distance-based as well as MP and ML analysis. Each internal node in the original tree is characterized by a so-called bootstrap value that is equal to the number of times a node appears in all tested trees. Although the relationship between bootstrap and statistical values is not linear, nodes with very high bootstrap values are considered to be reliable. In a Bayesian analysis, the support of internal nodes is quantified through posterior probability values. If species evolve according to a molecular clock model, the root position in a tree could directly be calculated from the observed inter-species differences as a midpoint of cumulative inter-species differences. Alternatively, the root position may be assigned to a tree from knowledge about the analyzed species that was gained independently from phylogenetic analysis. Commonly, this knowledge comes in the form of a single or more species which are assumed (or known) to have emerged before the 'birth' of the analyzed cluster. These early diverged species are collectively defined as 'outgroup', while the analyzed species may be called 'in group' (Fig. 3B, C) . Also, a tree may be generated unrooted, a common practice in phylogenetic analysis of viruses for which the applicability of the molecular clock model remains largely untested and reliable outgroups may not be routinely available (Fig. 4) . In an unrooted tree, grouping of species in separate clusters may be apparent, although these clusters may not be treated as monophyletic as long as the direction of evolution has not been defined. These challenges are addressed by the development of new approaches that infer rooted trees without artificially restricting species evolution to a constant rate (known as relaxed molecular clock models). Virus phylogeny can be inferred using either genomes or distinct genes and each of these approaches, standard in phylogenomics, may be considered as complementary. Under the first approach, genome-wide alignments are used for analysis. Due to complexities of the evolutionary process that may be region specific, reliable genome-wide alignments can routinely be built only for relatively closely related viruses whose analysis, however, may be further complicated by recombination events (see below). Using the second approach, genes with no evidence for recombination may be merged (concatenated) in a single data set that may be used to produce a superior phylogenetic signal compared to those generated for distinct genes or entire genomes. For viruses with small genomes or for a diverse set of viruses, it is common practice to use a single gene to infer virus phylogeny. Although the results produced may be the best models describing evolutionary history of a group of viruses, the validity of this gene-based approach for the genome-wide extrapolation remains a point of debate. Recently, network methods were used to infer and depict evolutionary relationships of multigene virus genomes taking into account gene-specific sequence affinities. When the gene tree is used as representing the phylogeny of the entire genome, an underlying most common assumption is that its topology but not branch lengths holds for different genomic regions in reflection of their coevolution with potentially different rates of substitution. This assumption may be violated due to several evolutionary processes, including orthologous gene exchange between (closely) related viruses, gene duplication and horizontal gene transfer (HGT), all involving one or another form of recombination, or incomplete lineage sorting. In phylogenetic terms, this violation may be revealed through incongruency of trees built for different genome regions (Fig. 3F) . Trees may also become incongruent due to various technical reasons related to the size and diversity of a virus data set. These characteristics complicate interpretation of the congruency test, which is widely used in different programs to identify recombination in viruses. Other pitfalls of phylogenetic reconstruction include the inability to resolve basal branching patterns of highly divergent lineages (Fig. 3D ) and the relatively close clustering of lineages that are only distantly related and do not form a monophyletic group in the true (unknown) phylogeny (Fig. 3E) . The latter phenomenon is known as long branch attraction (LBA) and the phylogenetic artifacts produced by LBA are most frequently observed for isolated, that is, long branches in the tree which represent distant lineages with no close relatives known. LG amino acid substitution model with site heterogeneity modeled by a gamma distribution with four categories, as selected by ProtTest, was used. In the Bayesian analysis, a relaxed molecular clock approach with log-normally distributed rate was applied. The trees are drawn to the same scale of average amino acid substitutions per site, as indicated by the bar in the middle. Note the considerably shorter branch lengths of the NJ tree compared to the other two trees. Robinson-Foulds distances measuring the topological differences between tree pairs are shown in gray. Phylogenetic analysis is used in a wide range of studies to address both applied and fundamental issues of virus research, including epidemiology, diagnostics, forensic studies, phylogeography, origin, evolution, and taxonomy of viruses. The first questions to be answered during an outbreak of a virus epidemic concern the virus identity and origin. Answers to these questions form the basis for implementing immediate practical measures and prospective planning, enabling specific and rapid virus detection and epidemic containment, which may include the use and development of antiviral drugs and vaccines. Among different analyses performed for virus identification at the early stage of a virus epidemic, the phylogenetic characterization is used for determining the relationship of a newly identified virus with all other previously characterized and sequenced viruses. Results of this analysis may be sufficient to provide answers to the questions posed, as regularly happens with closely monitored viruses that include most human viruses of high social impact, for example, influenza virus, human immunodeficiency virus (HIV), hepatitis C virus (HCV), poliovirus, and others. For these viruses, there exist large databases of previously characterized isolates and strains that comprehensively cover the so far characterized natural diversity. Should a newly identified virus belong to one of these species, chances are that it has evolved from a previously sampled isolate or a close variant and this immediately becomes evident in the clustering of these viruses in the phylogenetic tree. Combining the results of gene-specific and genome-wide phylogenetic analysis allows one to determine whether recombination contributed to the isolate origin. For instance, recombination was found to be extremely uncommon in the evolution of HCV, but not for poliovirus lineages that recombine promiscuously, also with closely related human coxsackie A viruses, both of which belong to the same virus species of human enteroviruses known as Enterovirus C. When an emerging infection is caused by a new never-before-detected virus, the phylogenetic analysis is instrumental for classification of this virus and in the case of a zoonotic infection, for determining the dynamic of virus introduction into the (human) population and initiating the search for the natural virus reservoir. This was the case with many emerging infections including those caused by Nipah virus, a paramyxovirus, SARS coronavirus (SARS-CoV), MERS coronavirus (MERS-CoV), ebolavirus, and Zika virus. In the case of SARS-CoV, poor sampling of the coronavirus diversity in the lineage at the time, some uncertainty over the relationship between phylogeny and taxonomy of coronaviruses, and the complexity of phylogenetic analysis of a virus data set including isolated distant lineages led to considerable controversy over the exact evolutionary position of SARS-CoV among coronaviruses. Since then, the matter has fully been resolved but this experience illustrates some challenges in inferring virus phylogeny. The search for a zoonotic reservoir of an emerging virus may involve a significant and time-consuming effort that requires numerous phylogenetic analyses of ever-expanding sampling of the virus diversity generated in pursuit of the goal. In this quest, phylogenetic analysis canalizes the effort and provides crucial information for reconstructing parameters of major evolutionary events that promoted the virus origin and spread. For instance, intertwining HIV and simian immunodeficiency virus (SIV) lineages in the primate lentivirus tree led to the postulation that the existing diversity of HIV in the human population originated from several ancestral viruses independently introduced from primates over a number of years. Similar phylogenetic reasoning was used to trace the origin of a local HIV outbreak to a common source of HIV introduction through dental practice (known as 'HIV dentist' case). These are typical examples illustrating the utility of phylogenetic analysis for epidemiological and forensic studies. Geographic distribution of places of virus isolation is another important characteristic relative to which virus phylogeny may be evaluated. This field of study belongs to phylogeography. The evolution of human JC polyomavirus provides an example of confinement of circulation of virus clusters to geographically isolated areas, represented by three continents. Identification of West Nile virus in the USA illustrates a geographical expansion of an Old World virus into the New World. Analysis of phylogenies of field isolates of rabies virus of the family Rhabdoviridae sampled from different animals across Europe led to the recognition that interspecies virus expansion occurs faster when compared to geographical expansion. Phylogenies can also reveal information about the relative strength of the virus-host association over time. In some virus families (e.g., the Coronaviridae) host-jumping events may be relatively frequent in establishing new species, including the emergence of at least three human viruses, dead-end SARS-CoV and MERS-CoV and successfully circulating human coronavirus OC43 (HCoV-OC43). At the other end of the spectrum one finds the family Herpesviridae. Extensive phylogenetic analysis of herpesviruses and their hosts showed a remarkable congruency of topologies of trees indicating that this virus family may have emerged some 400 million years ago and that herpesviruses largely cospeciate with their hosts. Moreover, through phylogenetic analysis one can show that most viruses, and in particular RNA viruses, evolve at rates that are orders of magnitude faster than those of cellular organisms. For instance, even the most conserved enzymes encoded by nidoviruses, that comprise just four RNA virus families, accumulated more than twice as many substitutions during evolution than their counterparts across the ToL, as estimated through branch lengths of the respective phylogenetic trees (Fig. 1) . Taking into account that the MRCA of all cellular organisms predates that of nidoviruses, this reveals that most residues of viral proteins changed repeatedly and more frequently than cellular protein residues during long-term evolution. In fact, this high evolutionary rate seems to be a prerequisite for RNA viruses to stay fit in the ever-changing environment considering their tiny genomes that would otherwise not be able to produce enough genetic variation. Phylogenetic analysis becomes increasingly important in virus classification (taxonomy) whose development relies on complex multicharacter rules applied to separate virus families by respective 'study groups'. For viruses united in high-rank taxa above the genus level, phylogenetic clustering for most conserved replicative genes is commonly observed and used in the decision making process. For instance, human hepatitis E virus, originally classified as a calicivirus using largely virion properties, was eventually expelled from the family due to poor fit of genome characteristics, including results of phylogenetic analysis. Phylogenetic considerations also played an important role in establishing new families, for example, the Marnaviridae and Dicistroviridae. In contrast, phylogenetic analysis has been of relatively little use in the taxonomy of large DNA phages which has been developed in such a way that existing families may unite phages with different gene layouts and phylogenies. The relationship between phylogeny and taxonomy is evolving and efforts were made in extracting taxa structure from monophyletic clusters in trees using analysis of pairwise evolutionary distances. In future one might hope for important advancements of virus taxonomy that improve cross-family consistency in relation to phylogeny. Comparative genomics and evolution of complex viruses Virus evolution Origin and evolution of viruses Bayesian phylogenetics with BEAUti and the BEAST 1.7 Inferring phylogenies Molecular basis of virus evolution Virus taxonomy: the 9th report of the international committee on taxonomy of viruses The footprint of genome architecture in the largest genome expansion in RNA viruses The population genetics and evolutionary epidemiology of RNA viruses Molecular evolution. A phylogenetic approach The phylogenetic handbook. A practical approach to DNA and protein phylogeny RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies Viruses and evolution of life Virus Evolution, Current Research and Future Directions AEG research was partially supported by Leiden University Fund and EU Horizon2020 project EVAg 653316.