key: cord-0035066-mqacqanq
authors: Catanzaro, Daniele
title: Estimating Phylogenies from Molecular Data
date: 2010-09-21
journal: Mathematical Approaches to Polymer Sequence Analysis and Related Problems
DOI: 10.1007/978-1-4419-6800-5_8
sha: 23998e68df98720a567bbbddf4db0db0383196d0
doc_id: 35066
cord_uid: mqacqanq

Phylogenetic estimation from aligned DNA, RNA or amino acid sequences has attracted more and more attention in recent years due to its importance in analysis of many fine-scale genetic data. Nowadays, its application fields range from medical research to drug discovery, to epidemiology, to systematics and population dynamics. Estimating phylogenies involves solving an optimization problem, called the phylogenetic estimation problem (PEP), whose versions depend on the criterion used to select a phylogeny among plausible alternatives. This chapter offers an overview of PEP and discuss the most important versions that occur in the literature.

Molecular phylogenetics studies the hierarchical evolutionary relationships among species, or taxa, by means of molecular data such as DNA, RNA, amino acid or codon sequences. These relationships are usually described through a weighted tree, called a phylogeny ( Fig. 8.1 ), whose leaves represent the observed taxa, internal vertices represent the intermediate ancestors, edges represent the estimated evolutionary relationships and edge weights represent measures of the similarity between pairs of taxa.

Phylogenies provide a fundamental information in analysis of many fine-scale genetic data; for this reason, the use of molecular phylogenetics has become more and more frequent (and sometimes indispensable) in several research fields such as systematics, medical research, drug discovery, epidemiology and population dynamics [56] . For example, the use of molecular phylogenetics was of considerable assistance to predict the evolution of human influenza A [8] , to understand the D. Catanzaro relationships between the virulence and the genetic evolution of HIV [55, 66] , to identify emerging viruses as SARS [51] , to recreate and investigate ancestral proteins [17] , to design neuropeptides causing smooth muscle contraction [2] and to relate geographic patterns to macroevolutionary processes [36] . Since no one could observe evolution over thousands or millions of years, a part from known phylogenies ( [57] ), there is no general way to validate empirically a candidate phylogeny for a set of molecular sequences extracted from taxa. For this reason, the literature proposes a number of criteria for selecting one phylogeny from among plausible alternatives. Each criterion adopts its own set of evolutionary hypotheses, whose ability to describe evolution of taxa determines the gap between the real and the true phylogeny, i.e., the gap between the real evolutionary process of taxa and the phylogeny that one would obtain under the same set of hypotheses if all molecular data from taxa were available [9] .

The criteria of phylogenetic estimation can usually be quantified and expressed in terms of objective functions, giving rise to families of optimization problems whose general paradigm can be stated as follows: where is the set of molecular sequences from n taxa, T a phylogeny of , T the set of .2n 5/ŠŠ D 1 3 5 7 2n 5 phylogenies of , f W T ! R a function modeling the selected criterion of phylogenetic estimation, and g W T ! R a function correlating the set to a phylogeny T .

A specific optimization problem, or phylogenetic estimation paradigm, is completely characterized by defining the functions f and g. The phylogeny T that optimizes f and satisfies g is referred to as optimal, and if T approaches the true phylogeny as the amount of molecular data from taxa increases, the corresponding criterion is said to be statistically consistent [32] . The statistical consistency is a desirable property in molecular phylogenetics because it measures the ability of a criterion to recover the true (and hopefully the real) phylogeny of the given molecular data. Later in this chapter, we will show that the consistency property changes from criterion to criterion and in some cases may be even absent.

Here, we provide a review of the main estimation criteria that occur in the literature on molecular phylogenetics. Particular emphasis is given to the comparative description of the hypotheses at the core of each criterion and to the optimization aspects related to the phylogenetic estimation paradigms. In Sect. 8.2, we discuss the problem of measuring the similarity among molecular sequences. In Sect. 8.3, we discuss the fundamental least-squares paradigm and formalize the concept of phylogeny. In Sect. 8.4, we present the minimum evolution paradigm by evidencing the recent perspectives and computational advances. Finally, in Sect. 8.5 we present the likelihood and the bayesian paradigms by exposing briefly their benefits and drawbacks.

The degree of similarity between pairwise molecular sequences reflects the amount of mutation events that occurred since they split from their common ancestor. Quantifying such similarity constitutes the first step in the phylogenetic estimation process [11] . The task involves the investigation and the modeling of the mutation process over time, i.e., the process by which errors occur in molecular data and are inherited between generations.

Different types of mutation may occur in the genome structure, most of which are point mutations, i.e., changes that involve the replacement, or substitution, of one nucleotide for another in the DNA sequence. Point mutations can be classified in two categories: the transitions and the transversions. The transitions occur when a purine nucleotide (adenine or guanine) is substituted for another purine, or when a pyrimidine (cytosine or thymine) is substituted for another pyrimidine. The transversions occur when a pyrimidine is substituted for a purine, or vice versa.

A second class of point mutations are those that lead to insertions and deletions of nucleotides in the genome. This phenomenon mainly occurs in non-coding regions of DNA, but may interest also coding regions of the genome and be the cause of deleterious effects [57] .

Finally, a third class of mutations are those that involve entire chromosome regions of the genome. Specifically, we may have: (1) a duplication, when a chromosome region is duplicated; (2) a translocation, when a chromosome region is transferred into another chromosome; (3) an inversion, when a chromosome region is broken off, turned upside down and reconnected; (4) a deletion, when a chromosome region is missing or deleted; (5) and a loss of heterozygosity, e.g., when two instances of the same chromosome break and then reconnect but to the different end pieces [57] .

Modeling the second and the third classes of mutations is generally non-trivial and requires advanced mathematical background. We refer the interested reader to Felsenstein [29] for an introduction and to Park and Deem [58] for recent advances in the modeling of such classes. Here we shall focus on the first class of mutations and present a fundamental model of molecular evolution which is at the core of the most currently used criteria of phylogenetic estimation. Unless otherwise stated, throughout the chapter we will always assume that the molecular sequences under study have been previously subjected to an alignment process, i.e., a process through which the evolutionary relationships between nucleotides of molecular data are evidenced (see [60] for details).

Let S be a DNA sequence, i.e., a string of fixed length over an alphabet D fA; C; G; T g, where "A" codes for adenine, "C" for cytosine, "G" for guanine, and "T" for thymine. Let r ij 0, i ¤ j , be the constant rate of substitution from nucleotide i to nucleotide j . Assume that each character (site) of S evolves independently over time and that, instant per instant, the Markov conservative hypothesis [39] holds, i.e.,

Let p ij .t/ be the probability that nucleotide i undergoes to a substitution to nucleotide j at finite time t. Then, if the superposition principle holds, at t C dt such probability can be written as:

By subtracting p ij .t/ in both sides of (8.2) and dividing for dt we obtain:

i.e.,

Hence, we have 

is known as the time homogeneous Markov (THM) model of DNA sequence evolution [48, 63] . The THM model is a generalization of the Markov models described in Jukes and Cantor [44] , Kimura [46] , Hasegawa et al. [37] , Tamura and Nei [78] , and can be easily adapted to RNA, amino acid and codon sequences as shown in Felsenstein [29] and Schadt and Lange [71, 72] . In the next section, we shall investigate the dynamics of the THM model in order to derive a commonly used formula to quantify the similarity between molecular data.

Two molecular sequences S 1 and S 2 , evolving at time t 0 from a common ancestor, could be characterized at time t by different amounts of substitution events, some of which not directly observable. Hence, if we would sample the sequences at time t and measure their similarity, or evolutionary distance, in terms of number of observed differences, we could underestimate the overall substitution events that occurred since S 1 and S 2 split from their common ancestor. A number of authors suggested that the use of the time homogeneous Markov models could overcome the underestimation problem in all those cases in which the hypotheses at the core of the model would properly describe the real evolutionary process of the analyzed sequences [29] . Moreover, in order to compare the evolutionary distances of different pairs of molecular sequences, the authors also proposed to express the evolutionary distances in terms of expected number of substitution events per site rather than the time necessary to transform a sequence into another [29] . In this section, we will present the most general formula currently known in the literature to compute the evolutionary distance from pairwise molecular sequences. To this aim, we shall investigate now the dynamics of the THM model. As shown in Zadeh and Desoer [84] , (8.4) can also be expressed in closed formula as:

where˝is the eigenvector matrix of R, and is the diagonal matrix of the eigenvalues of R. This fact suggests that the spectrum of P.t/ is the exponential spectrum of R, i.e., the dynamics of P.t/ is univocally determined from the knowledge of the spectrum of R [84] . It is worth noting that the Markov conservative hypothesis implies that the determinant of matrix R is equal to zero, i.e., at least one of its eigenvalues is identically zero. Moreover, since any k-leading principal sub-matrix of R, k < 4, has negative determinant, for one of the Sylvester corollaries (see [6, p. 409 ]) all the remaining eigenvalues are negative. Thus, as the spectrum of P.t/ is the exponential spectrum of R, matrix P.t/ has at least one eigenvalue equal to 1, called the maximal Lyapunov exponent, and three eigenvalues lying in the interval OE0; 1. The maximal Lyapunov exponent prevents the presence of chaotic attractors and guarantees that, as t goes to infinity, the generic entry p ij .t/ is non-zero and independent on the starting state i 2 . In other words, the maximal Lyapunov exponent guarantees the existence of four positive values A , C , G , and T , called equilibrium frequencies, such that

The values j constitute a stationary distribution and turn out to be useful to measure the evolutionary distance between S 1 and S 2 . In fact, denote O.t/ as a matrix whose generic entry o ij .t/, i; j 2 , represents the probability that at a given site and time t, S 1 is characterized by nucleotide i and S 2 by nucleotide j . Assume that O.0/ D˘, where˘denotes a diagonal matrix whose j th diagonal entry is j . Then it holds that:

or equivalently

where P 0 .t/ denotes the transpose of P.t/. Premultiplying for˘ 1 both sides of (8.6) we have˘

Since for any matrix function it holds that f .ABA 1 / D Af .B/A 1 , we havȇ

If we assume that the hypothesis of time-reversibility holds, i.e.:

then˘ 1 R 0 t˘and Rt are commutative, and (8.7) becomes:

By applying the logarithmic matrix function to both members of (8.8) and premultiplying for˘, we obtain R 0 t˘C˘Rt D˘log.˘ 1 O.t//:

As the negative trace of 2t˘R represents the expected number of substitution events per site between S 1 and S 2 , at time t the evolutionary distance d S 1 ;S 2 between S 1 and S 2 can be computed as:

Equation (8.9) is known as the general time-reversible (GTR) distance [48, 63] and is the most general formula to quantify the similarity between molecular data using a time-reversible Markov model of molecular evolution. It is worth noting that if in one hand the hypothesis of time-reversibility simplifies the formalization of the evolutionary process of a pair of molecular sequences, on the other hand its introduction gives rises to important consequences. In fact, the hypothesis of time-reversibility implies that if we would compare two molecular data whose nucleotide frequencies are in equilibrium, the probability that a nucleotide i undergoes a substitution to nucleotide j would be equal to the probability that a nucleotide j undergoes a substitution to nucleotide i . Thus, given a present-day molecular sequence and its ancestral sequence, it would be impossible to determine which sequence is the present and which is the ancestral one. Hence, the hypothesis of time-reversibility removes the temporality from the evolutionary process. We shall show in the next sections how the paradigms of phylogenetic estimation take advantage of this fact. Below, we provide an example from [79] showing a possible application of (8.9).

Consider the mitochondrial DNA sequences of human and chimpanzee showed in Horai et al. [40] . 

The product˘ 1 O.t/ is: 

The product˘log.˘ 1 O.t// is: 

A paradigm of phylogenetic estimation is a quantitative criterion used to discern a phylogeny from among plausible alternatives. One of the earliest paradigms was introduced by Cavalli-Sforza and Edwards [15] and is known as the additive model or the the least-squares model of phylogenetic estimation [9] .

Cavalli-Sforza and Edwards observed that as molecular data provide the most detailed anatomy possible for any organism, the diversity of life on Earth must be reflected in them. Hence, if evolution of a set of molecular data from taxa could be seen as a tree, then it could be described through a process that changes nucleotides over time. The trajectories described by such a process would split as taxa diverges, unite as taxa hybridize, end as taxa become extinct, and living taxa would be represented by the intercept of the process and the "now" plane ( Fig. 8.2 ).

In general, we do not have a sampling of such a process over time but only the knowledge of the living taxa. Hence, in absence of further information, one may be able only to reconstruct the projection of the process onto the "now" plane rather than the process itself. Note that altough the evolutionary process over time is "directed," its projection is not ( Fig. 8.2 ). Thus, when the projection is considered, the direction of evolution is definitely missed. Nevertheless, the projection of the evolutionary process constitutes still an important piece of information for the analyzed taxa; for this reason, Cavalli-Sforza and Edwards proposed a possible paradigm to recover it. An evolutionary process and its projection onto the "now" plane -from Cavalli-Sforza and Edwards [15] The authors first considered the problem of how to represent formally a projection (phylogeny) of the evolutionary process. In order to remark the lack of a direction in evolution, the authors proposed to remove the root and the orientation in the edges of a phylogeny and represented it as an unrooted binary tree, i.e., an undirected acyclic graph in which each internal vertex has degree three. The degree constraint has not necessarily a biological foundation but helped the authors to formalize the evolutionary process. In fact, given n taxa, the degree constraint implies that the number of edges in a phylogeny T is .2n 3/ and the number of internal vertices is .n 2/. To prove the claim note that as T is a tree, it holds that:

where E e .T / and E i .T / are the set of external and internal edges of T , respectively. Moreover, since internal vertices have degree three, the following property holds:

Combining (8.10) and (8.11) it follows that jV i j D .n 2/ and jE i j D .n 3/. Thus, a phylogeny T 2 T can be seen as an unrooted binary tree in which the n taxa are the n leaves of T and the common ancestors are internal vertices of degree three. It is worth noting that dealing with unrooted binary trees does not introduces oversimplifications since it is easy to see that any m-ary tree can be transformed into a phylogeny by adding "dummy" vertices and edges (e.g., see Fig. 8 .3).

Cavalli-Sforza and Edwards encoded a phylogeny in T by means of an Edge-Path incidence matrix of a Tree (EPT) (see [53, p. 550]) i.e., a network matrix X having a row for each path between two leaves and a column for each edge. The generic entry x rs;e of matrix X is equal to 1 if edge e belongs to the path p rs from leaf r to leaf s and 0 otherwise. As an example, Fig. 8 .4b shows the EPT matrix corresponding to the phylogeny shown in Fig. 8 .4a. Hence, the authors proposed a model in which each evolutionary distance d rs , r; s 2 , among pairwise molecular data could be thought of as the resulting sum of mutation events accumulated on edges belonging to the path p rs linking taxa r and s on X. In other words, fixed a phylogeny X and defined w e as the amount of mutation events on edge e, Cavalli-Sforza and Edwards asserted that:

where w D fw e g is the edge weight vector associated with X, and D 4 is a n.n 1/=2 vector whose components are obtained by taking row by row the entries of the strictly upper triangular matrix D D fd rs g.

In general, for a fixed matrix X, (8.12) may not admit solutions; for this reason, the authors proposed the use of the ordinary least-squares (OLS) to find the entries of vector w. Specifically, the authors suggested that the values rs D P e2p rs x rs;e w e should minimize the function, i.e., minimize the quadratic error related to the approximation of the evolutionary process with its projection. This condition holds when

where X is the Moore-Penrose pseudo-inverse matrix of X. Thus, Cavalli-Sforza and Edwards' paradigm of phylogenetic estimation may be stated in terms of the following NP-hard convex optimization problem [22] :

where X denotes the set of all possible EPT matrices coding phylogenies. We refer the reader interested in a mathematical description of the necessary and sufficient conditions that characterize the set X to [14] .

A number of authors proposed some modifications to Cavalli-Sforza and Edwards' model. Specifically, Fitch and Margoliash [31] observed that OLSP implicitly considers the evolutionary distances d rs among pairwise molecular data as uniformly distributed independent random variables, a hypothesis that cannot be considered generally true due to the common evolutionary history of the analyzed taxa and the presence of sampling errors in molecular data. Hence, Fitch and Margoliash proposed to modify Cavalli-Sforza and Edwards' paradigm by introducing the quantities ! rs representing the variances of d rs . They set ! rs D 1=d 2 rs , r; s 2 , and stated the following paradigm: Later, Chakraborty [16] and Hasegawa et al. [38] proposed a very similar paradigm, called the generalized least-squares problem (GLSP), in which the variances ! rs are replaced by the covariances of d rs . Nowadays, GLSP has fallen into disuse due to its statistical inconsistency problems [9] .

Although the least-squares paradigm is a milestone in molecular phylogenetics, it is characterized by a number of drawbacks. For example, Cavalli-Sforza and Edwards' paradigm returns a tree metric, i.e., a phylogeny whose edge weights are non-negative [73, 80] , whenever the distance matrix D satisfies the ultrametric property d rs Ä maxfd rq ; d qs g r; s; q 2 W r ¤ s ¤ q or the additive property

Specifically, when D is ultrametric or additive, the solution of Problem 8.2 is unique and obtainable in polynomial time through the UPGMA greedy algorithm [74] or the sequential algorithm [80] , respectively. Unfortunately, when D is generic (e.g., when it is obtained by means of the THM model, see Sect. 8.2), the least-squares paradigm may lead to the occurrence of negative entries in the vector w, i.e., to a phylogeny that is not a tree metric [32, 47] . Negative edge weights are infeasible both from a conceptual point of view (a distance, being an expected number of mutation events over time, cannot be negative [45] ) and from a biological point of view (evolution cannot proceed backwards [57, 77] ). For the latter reason at least, non-tree metric phylogenies are generally not accepted in molecular phylogenetics [35] .

In response, some authors investigated the consequences of adding or guaranteeing the positivity constraint of edge weights in the least-squares paradigm.

Gascuel and Levy [33] observed that the presence of the positivity constraint transforms any least-square model into a non-negative linear regression problem which involves projecting the distance matrix D onto the positive cone defined by the set of tree metrics (see also [5, p. 187] ). Thus, the authors designed an iterative polynomial time algorithm able to generate a sequence of least-squares projections of D onto such a set until an additive distance matrix (and the corresponding phylogeny) is obtained.

Farach et al. [26] proposed an alternative approach to impose the positivity constraint. Specifically, the authors proposed to find the minimal perturbation of the distance matrix D that guarantees the satisfaction of the additive or the ultrametric property. Farach et al. [26] proposed the L 1 -norm and L 1 -norm to constraint the entries of D to satisfy the additive (ultrametric) property, and proved that such a problem can be solved in polynomial time when D is required to be ultrametric under the L 1 -norm. By contrast, the authors proved that their approaches become hard when an ultrametric or an additive distance matrix is required under the L 1 -norm.

Finally, Barthélemy and Guénoche [3] and Makarenkov and Leclerc [50] proposed a Lagrangian relaxation of the positivity constraint to guarantee metric trees. Both algorithms are iterative and apply to the OLSP and the WLSP, respectively. Specifically, starting from a leaf, the algorithms generate a phylogeny with a growing number of leaves by solving an optimization problem in which the best non-negative edge weights that minimize the OLSP (respectively the WLSP) are found. Both algorithms are polynomial time and characterized by a computational complexity of O.n 4 / and O.n 5 /, respectively. FITCH, was also proposed by Felsenstein [27] .

A second and possibly more serious drawback of the least-squares is the statistical inconsistency of some paradigms. Specifically, a part from the OLSP which proves to be statistically consistent [23, 68] , the only case in which the WLSP is known to be consistent, is when the variances ! rs are set to the inverse of the product of two strictly positive constants˛i and˛j . By contrast the GLSP is generally inconsistent [35] .

Kidd and Sgaramella-Zonta [45] and Beyer et al. [4] independently proposed an alternative paradigm known as the minimum evolution problem or the minimum evolution paradigm of phylogenetic estimation [9] . The minimum evolution paradigm arises from Cavalli-Sforza and Edwards' model but mainly differs for the way in which a phylogeny is chosen from among possible alternatives. In fact, the minimum evolution criterion states that if the evolutionary distances d rs were unbiased estimates of the true evolutionary distances (i.e., the distances that one would obtain if all the molecular data from the analyzed taxa were available), then the true phylogeny would have an expected length shorter than any other possible phylogeny compatible with D. Hence, the minimum evolution paradigm aims at finding the phylogeny whose sum of edge weights, estimated from the corresponding evolutionary distances, is minimum [9] .

It is worth noting that the minimum evolution criterion does not asses that molecular evolution follows minimum paths, but states, according to classical evolutionary theory, that a minimum length phylogeny may properly approximate the real phylogeny of well-conserved molecular data, i.e., data whose basic biochemical function has undergone small change throughout the evolution of the observed taxa [4] . That evolution proceeds by small rather than smallest changes is due to the fact that the neighborhood of possible alleles that are selected at each instant of the life of a taxon is finite, and perhaps more important, the selective forces acting on the taxon may not be constant throughout its evolution [4, 80] . Over the long term (periods of environmental change, including the intracellular environment), small changes will not generally provide the smallest change. Thus, a minimum length phylogeny provides a lower bound on the total number of mutation events that could have occurred along evolution of the observed taxa.

Different versions of the minimum evolution paradigm are discussed in the literature on phylogenetics, and each one is characterized by its own edge weight estimation model [9] . Specifically, we can distinguish between the least-squares edge weight estimation model [24, 68, 69] and the linear programming edge weight estimation model [4, 14, 80] . In the next sections, we shall analyze both families in detail.

The earliest minimum evolution paradigm of phylogenetic estimation was proposed by Kidd Rzhetsky and Nei [68, 69] observed that the MELSP is statistically consistent, and such a property is also guaranteed when considering a relaxed version of the objective function in which edge weights are summed regardless their sign. However, Swofford et al. [77] criticized the choice of taking into account negative edge weights (or even their absolute value) in the objective function due to their biological unfeasibility. Thus, the authors proposed to replace the objective function Gascuel et al. [35] investigated the statistical consistency of Swofford et al. [77] paradigm and obtained analogous results to Rzhetsky and Nei [68, 69] . At present, Swofford et al. [77] paradigm is one of the most used versions of minimum evolution, being implemented in the well-known software for phylogenetic estimation "PAUP" [76] . The software is able to solve exactly instances of the paradigm containing upto 13 taxa and implements a hill-climbing metaheuristic to tackle larger instances of the problem.

Recently, Desper and Gascuel [24, 25] formalized the most recent version of the minimum evolution paradigm, called the Balanced Minimum Evolution problem (BME). The paradigm is based on Pauplin [59] seminal work in which the author criticized the biological consideration at the core of the OLSP. In fact, Pauplin noted that when computing the Moore-Penrose pseudo-inverse of the EPT matrix X, some edges can be weighted more than others. Since there is no biological justification for that, Pauplin proposed a new paradigm in which all edges of a phylogeny were weighted in the same way. The resulting objective function does not depend explicitly on edge weights and can be stated as follows:

where rs is called the topological distance and denotes the number of edges belonging to the path between taxa r and s in a phylogeny T [9] . Hence, BME can be stated in terms of the following optimization problem: BME is known to be statistically consistent [24, 25] and its optimal solution satisfies the positivity constraint whenever the distance matrix satisfies the triangular inequality d rs Ä d rq C d qs 8 r; s; q 2 W r ¤ s ¤ q:

For the latter reason at least, finding the optimal solution to instances of BME is highly desirable. Unfortunately, this task seems hard, although at present no information about the complexity of BME is known in the literature. Recent advances in the polyhedral combinatorics of BME led to solve exactly instances containing up to 20-25 taxa [13] . However, the size of the instances analyzable to the optimum is still far away from real needs; for this reason, the use of clustering heuristics (Fig. 8.5) , such as the neighbor-joining tree (NJT) ( [70, 75] ), is common to tackle large instances of BME. Possibly, future developments on the polyhedral combinatorics of BME will provide fundamental new insights for the development of more efficient exact approaches to solution of the problem. 

An alternative model to estimate edge weights in the minimum evolution paradigm is provided by linear programming. The model was introduced by Beyer et al. [4] and is based on the following motivation: if the evolutionary distances between pairs of molecular data have to reflect the number of mutation events required to convert one molecular sequence into another over time, then they must satisfy the triangle inequality. Moreover, since any edge weight of a phylogeny is de facto an evolutionary distance, also the entries of vector w must satisfy the triangle inequality. This last observation imposes that for each path p rs from taxa r and s in X, the constraint P e2p rs w e x rs;e d rs is satisfied. Hence, Beyer et al. [4] proposed a possible paradigm of phylogenetic estimation consisting of solving the following mixed integer programming model: MELP is a well-known APX-hard problem [26] for which the current exact algorithms described in the literature provide solutions to instances containing not more than a dozen taxa [14] . To the best of our knowledge, nothing is known about the statistical consistency of MELP.

There are mainly two drawbacks that affect the minimum evolution paradigm of phylogenetic estimation: the "rigidity" of its criterion and the hardness of its paradigms.

As regards to the first drawback, some authors, among which notably Felsenstein [29, p. 175] , argued that the minimum evolution paradigms could prove unreliable as it neglects rate variation when estimating edge weights. This major criticism could be possibly overcome using non-homogeneous Markov models. Specifically, in a non-homogeneous Markov model, the Chapman-Kolmogorov master equation becomes [84] : since it is possible to prove that (8.16) converges to matrix P.0; t/ when k ! 1 [18] . Hence, under a non-homogeneous Markov model, the substitution probability matrix could be easily computed by means of iterative procedures that appropriately approximate (8.15 ). Concerning the second drawback, it is easy to realize that the NP-hardness of the minimum evolution paradigms constitutes a big handicap for the development of exact solution approaches of practical use. Exact approaches are necessary to guarantee the optimality of a given solution and fundamental to investigate whether the hypotheses at the core of a criterion are well suited to describe the evolutionary process of the observed taxa. At present, most molecular datasets involve hundreds of taxa, whereas the current exact solution approaches have difficulty to tackle instances containing more than two dozen taxa (even smaller for the linear programming paradigm). Increasing the size of the datasets analyzable to the optimum is possibly one of the most challenging problems in molecular phylogenetics and warrants for sure further research efforts.

One of the most used criteria of phylogenetic estimation is the likelihood criterion. First formalized by Felsenstein [28] , the likelihood criterion states that under many plausible explanations of an observed phenomenon, the one having the highest probability of occurring should be preferred to the others. When the likelihood criterion is applied to phylogenetic estimation, a phylogeny is defined to be optimal (or the most likely) if it has the highest probability of explaining the observed taxa. Thus, the likelihood paradigm consists of finding the phylogeny that maximizes a stochastic function, called the likelihood function, modeling a set of evolutionary hypotheses of the observed taxa.

The fundamental difference that distinguishes the likelihood paradigm from the least-squares and the minimum evolution paradigms is the nature of the information that it aims at finding. Specifically, if the least-squares and the minimum evolution paradigms aim at finding the best possible approximation of the projection of the evolutionary process of the observed taxa, the likelihood paradigm aims at reconstructing the most likely evolutionary process that originated the observed taxa. Hence, if the phylogeny of the least-squares and the minimum evolution paradigms is an unrooted binary tree, the phylogeny of the likelihood paradigm is a rooted phylogeny, i.e., full binary tree having .2n 1/ vertices.

Formally, the likelihood function is defined to be a recursive function of a fixed rooted phylogeny T , a model of molecular evolution M and an observed data matrix S D fs rc g, i.e., a matrix whose rth row represents the molecular sequence of the r-th taxon. Defined the quantity

for each leaf r of T , each column c of S and each i 2 , and the quantity where denotes the root of T . In the context of the likelihood paradigm, the expected numbers of substitutions per site t v h ;v k assume the analogous meaning of edge weights in the least-squares and minimum evolution paradigms. Hence, when a given model of molecular evolution is assumed to hold (e.g., the THM model), finding the most likely phylogeny for a set of molecular sequences means maximizing the nonlinear (usually) non-convex stochastic function L.T; S; M / over all the possible rooted phylogenies, and for each rooted phylogeny, over all the possible associated edge weights t v h ;v k and substitution probabilities p ij .t v h ;v k /.

The NP-hardness of the likelihood paradigm [62] justified the development of a number of approximate solution approaches typically based on hill climbing strategies. Specifically, the strategies consist of a first phase in which the structure of a best-so-far phylogeny is modified and a second phase in which the nonlinear optimization of edge weights and the substitution probabilities is performed. The two phases are consecutively iterated until a stopping criterion is satisfied (e.g., the number of iterations performed or the elapsed time) [7, 28, 64] . A systematic review of the hill climbing strategies for the likelihood paradigm is out of the scope of the present chapter and can be found in Bryant et al. [7] .

Recent mathematical advances on the likelihood paradigm led to overcome several limitations of the initial Felsenstein's model, such as the absence of a rate variation among sites [81] and the absence of correlated evolution among sites [61] . Moreover, several progresses have been done concerning the analysis of its statistical consistency and its idenfiability, i.e., the study of the conditions under which the likelihood function is at least injective, an aspect markably related to its consistency [7] . The reader may find useful to refer to Gascuel [32] and Gascuel and Steel [34] for an overview of these aspects.

Given a dataset of molecular sequences, suppose we have sufficient empirical evidence to assert that the evolution of the observed taxa followed a specific stochastic process. Then, we could try to combine this a priori information with the likelihood function in order to bias the search of the most probable phylogeny through those solutions that fit the known evolutionary process. This idea is at the core of the most recent likelihood-derived paradigm of phylogenetic estimation, called the bayesian paradigm, and will be briefly described in this section.

Similar to the likelihood paradigm, the bayesian paradigm aims at finding the phylogeny that has the highest probability to recover the evolutionary process of the observed taxa. However, the selection of the most probable phylogeny is performed in light of the a priori information. Specifically, the a priori information is usually modeled by means of peculiar probability distributions, called prior distributions, which mainly concern three parameters, namely: the topology, i.e., the structure of the phylogeny, edge weights and the substitution probabilities. Defined

as the edge weight space and R D Hence, finding the optimal solution for the bayesian paradigm means finding the phylogeny T i , the associated edge weights and the substitution probabilities that globally maximize the posterior probability distribution of phylogenies B.T; S; M /.

Since finding the maximum a posteriori phylogeny implicitly implies being able to solve the likelihood paradigm, solving the bayesian paradigm is NP-hard [29] . The recursive nature of the likelihood function and the intractability of computing the denominator of Bayes' theorem prevent an analytical approach to solution of the bayesian paradigm. Hence, the maximum a posteriori phylogeny is usually computed by means of a Markov chain Monte Carlo (MCMC) algorithm [30] , i.e., an algorithm that samples B.T; S; M / through a stochastic generation of phylogenies in T ( [49, 52, 83] ). Sampling B.T; S; M / is extremely time consuming; therefore, the bayesian estimations may take even weeks [42] . However, as observed by Yang [82] and Huelsenbeck et al. [41, 43] , the sampling process has also the indisputable benefit of providing a measure of the reliability of the best-so-far solution found. In fact, by sampling stochastically around the (best local) maximum a posteriori phylogeny T , the bayesian paradigm could determine support values for the subtrees of T , i.e., measures of the posterior probability that the subtrees are true.

The bayesian paradigm is possibly the most complex among the phylogenetic estimation paradigms currently available in the literature on molecular phylogenetics. The recent computational advances obtained by Ronquist and Huelsenbeck [65] speeded up the execution of the MCMC algorithm and widened the use of the bayesian paradigm. However, the lack of a systematic investigation of its statistical consistency and the unclear dependence of the posterior density function on the a priori information [82] possibly make the bayesian paradigm still unripe for phylogenetic estimation [1] .

The higher the complexity of a paradigm, the higher the number of draw-backs that could arise, and the likelihood and the bayesian paradigms do not escape the rule. Specifically, a number of computational and theoretical drawbacks affect the two paradigms. The computational drawbacks mainly involve (i) the optimization aspects of the likelihood function and (ii) the sampling process in the bayesian paradigm. The theoretical drawbacks concern the evolutionary hypotheses at the core of the likelihood and bayesian criteria. As regards to the computational drawbacks, in Sect. 8.5 we have seen that finding the most likely phylogeny for a set of taxa involves maximizing a nonlinear and generally non-convex stochastic function over all the possible phylogenies in T , and for each phylogeny, over all the possible edge weights and substitution probabilities. Notoriously, this task can be only performed in an approximate way, due to a lack of general mathematical conditions that guarantee the global optimality of a solution in nonlinear non-convex programming [21, 54] . Hence, although it is possible (at least for small datasets) to enumerate all the possible phylogenies in T , it is not possible to optimize globally edge weights and the substitution probabilities of a fixed phylogeny T . This fact may affect negatively the statistical consistency of the likelihood and the bayesian paradigms. In fact, the local optima of the likelihood function grows up exponentially in function of the number of taxa considered [7, 19, 20] . Thus, fixed a phylogeny T , the global optimum of the likelihood function is generally approximated by means of hill-climbing techniques that jump from local optimum to another one until a stopping criterion is satisfied (e.g., the number of iterations performed or the elapsed time) [7, 28, 64] . Assume that two phylogenies T 1 and T 2 are given, and let 1 and 2 be two vectors whose entries are edge weights and the substitution probabilities associated to T 1 and T 2 , respectively. Let z 1 and z 2 , the likelihood values of T 1 and T 2 for 1 and 2 , respectively, and assume, without loss of generality, that z 1 > z 2 . Due to the local nature of the optima 1 and 2 , there could exists another local optimum, say O 2 , such that O z 2 > z 1 > z 2 . If the hill-climbing algorithm finds O 2 before 2 , then we will consider T 2 as a better phylogeny than T 1 , otherwise we will discard T 2 in favor of T 1 . Hence, it is easy to realize that if one of the two phylogenies is the true phylogeny, its acceptance is subordinated to the goodness of the hill-climbing algorithm used to optimize the likelihood function, and as a result the statistical consistency of the likelihood and bayesian paradigms may be seriously compromised.

Some authors argued that multiple local optima should arise infrequently in real datasets [64] , but this conjecture was proved false by Bryant et at. [7] and Catanzaro et al. [12] . Specifically, Bryant et al. [7] observed that changing the model of molecular evolution influences the presence of multiple optima in the likelihood function, and Catanzaro et al. [12] showed a number of real datasets affected by strong multimodality of the likelihood function. Despite the importance of the topic, to the best of our knowledge nobody was able to propose a plausible solution to this critical aspect.

A second computational drawback concerns the sampling process of the bayesian paradigm. In fact, as shown in Sect. 8.5.1, the approximation of the posterior density function is generally performed by means of a MCMC algorithm (e.g., the Metropolis or the Gibbs sampling algorithm [30] ) that performs random walks in T . The random walk should be sufficiently diversified to sample potentially the whole T and avoid double backs (i.e., to sample phylogenies already visited). Unfortunately, despite the recent computational advances in the bayesian paradigm [65] , no technique may guarantee a sufficient diversification of the sampling process. Hence, the convergence to the maximum a posteriori phylogeny in practice becomes the convergence to the best-so-far a posteriori phylogeny that can be arbitrarily distinct from the true phylogeny (see [29, p. 296] ).

As regards to the theoretical drawbacks, it is worth noting that the evolutionary hypotheses at the core of the likelihood and bayesian criteria of phylogenetic estimation are at the same time their strength and their weakness. For example, if a proposed model of molecular evolution matches (at least roughly) the real evolutionary process of a set of molecular data, then the likelihood and the bayesian paradigms could succeed in recovering the real phylogeny of the corresponding set of taxa (provided a solution to their computational drawbacks). However, if it is not the case, the paradigms will just provide a (sub)optimal solution for that model that may completely mismatch the real phylogeny. This aspect becomes evident e.g., in Rydin and Källersjö [67] 's article where, for a same dataset, two different Markov model of molecular evolution are used and two different maximum posterior phylogenies are obtained both having the 100% posterior probability of supporting the true phylogeny. concerns in general all the paradigms discussed in this chapter and possibly there is no easy solution for it.

Finally, a second theoretical drawback concerns the prior distributions of the bayesian paradigm. In fact, it is worth noting that if on one hand a strength of the bayesian paradigm is the ability to incorporate the a priori information, on the other hand this information is rarely available, hence in practical applications the prior distributions are generally modeled as uniform distributions, frustrating the potential strengths of the paradigm [1] . Moreover, it is unclear what type of information is well suited for a prior distribution; how possible conflicts among different sets of a priori information can be resolved; and if the inclusion of prior distributions strongly bias the estimation process. Huelsenbeck et al. [43] vaguely claimed "in a typical Bayesian analysis of phylogeny, the results are likely to be rather insensitive to the prior," but this results was not confirmed by Yang [82] who observed that "[...] the posterior probabilities of trees vary widely over simulated datasets [...] and can be unduly influenced by the prior [...]." Possibly, further research efforts are needed to provide answers to these practical concerns.

The success of a criterion of phylogenetic estimation is undoubtedly influenced by the quality of the evolutionary hypotheses at its core. If the hypotheses match (at least roughly) the real evolutionary process of a set of taxa, then the criterion will hopefully succeed in recovering the real phylogeny. Otherwise, the criterion will miserably fail, by suggesting an optimal phylogeny that mismatch partially or totally the correct result. Since we are far away from a complete understanding of the complex facets of evolution, it is not generally possible to assess the superiority of a criterion over others. Hence, families of estimation criteria cohabit in the literature of molecular phylogenetics, by providing different perspectives about the evolutionary process of the involved taxa.

In this chapter, we have presented a general introduction of the existing literature about molecular phylogenetics. Our purpose has been to introduce a classification scheme in order to provide a general framework for papers appearing in this area. In particular, three main criteria of phylogenetic estimation have been outlined, the first based on the least-squares paradigm, first proposed by Cavalli-Sforza and Edwards [15] , the second based on the minimum evolution paradigm, independently proposed by Kidd and Sgaramella-Zonta [45] and Beyer et al. [4] , and the third based on the likelihood paradigm, first proposed by Felsenstein [28] . This division has been further disaggregated into different, approximately homogeneous sub-areas, and the basic aspects of each have been pointed out. For each, also, the most relevant issues affecting their use in tackling real-world sized problems have been outlined, as have the most interesting refinements deserving further research effort.

Bayesian inference of phylogeny: A nontechnical primer

Industrial applications of high-performance computing for phylogeny reconstruction

Trees and proximity representations

A molecular sequence metric and evolutionary trees

Numerical methods for least-squares problems

Optimization: Insights and applications

Likelihood calculation in molecular phylogenetics

Predicting the evolution of human influenza A

The minimum evolution problem: Overview and classification

Assessing the applicability of the GTR nucleotide substitution model through simulations

A non-linear optimization procedure to estimate distances and instantaneous substitution rate matrices under the GTR model

A very large-scale neighborhood search to estimate phylogenies under the maximum likelihood criterion

The balanced minimum evolution problem

Mathematical models to reconstruct phylogenetic trees under the minimum evolution criterion

Phylogenetic analysis: Models and estimation procedures

Estimation of time of divergence from phylogenetic studies

Recreating ancestral proteins

Sistemi Dinamici -Parte I

Multiple maxima of likelihood in phylogenetic trees: An analytic approach

Maximum likelihood jukes-cantor triplets: Analytic solutions

Trust-region methods

Computational complexity of inferring phylogenies from dissimilarity matrices

On the consistency of the minimum evolution principle of phylogenetic inference

Fast and accurate phylogeny reconstruction algorithms based on the minimum evolution principle

Theoretical foundations of the balanced minimum evolution method of phylogenetic inference and its relationship to the weighted least-squares tree fitting

A robust model for finding optimal evolutionary trees

An alternating least-squares approach to inferring phylogenies from pairwise distances

Evolutionary trees from DNA sequences: A maximum likelihood approach

Monte Carlo: Concepts, algorithms, and applications

Construction of phylogenetic trees

Mathematics of evolution and phylogeny

A reduction algorithm for approximating a (non-metric) dissimilarity by a tree distance

Reconstructing evolution

Strengths and limitations of the minimum evolution principle

New uses for new phylogenies

Evolutionary trees from DNA sequences: A maximum likelihood approach

Dating the human-ape splitting by a molecular clock of mitochondrial DNA

Stochastic models, volume 2 of Handbooks in operations research and management science

Man's place in the hominoidea revealed by mitochondrial DNA genealogy

MrBayes: Bayesian inference of phylogenetic trees

Bayesian inference of phylogeny and its impact on evolutionary biology

Potential applications and pitfalls of bayesian inference of phylogeny

Evolution of protein molecules

Phylogenetic analysis: Concepts and methods

A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences

A simulation comparison of phylogeny algorithms under equal and unequal rates

A new method for calculating evolutionary substitution rates

Phylogenetic tree construction using Markov chain Monte Carlo

An algorithm for the fitting of a tree metric according to a weighted least-squares criterion

Phylogenetic inference for binary data on dendograms using Markov chain Monte Carlo

Integer and combinatorial optimization

of Handbooks in operations research and management science

Molecular epidemiology of HIV transmission in a dental practice

The mathematics of phylogenomics

Molecular evolution: A phylogenetic approach

Phase diagrams of quasispecies theory with recombination and horizontal gene transfer

Direct calculation of a tree length using a distance matrix

Computational molecular biology

Coevolving protein residues: Maximum likelihood identification and relationship to structure

A short proof that phylogenetic tree reconstruction by maximum likelihood is hard

The general stochastic model of nucleotide substitution

Multiple local maxima for likelihoods of phylogenetic trees from nucleotide sequences

MrBayes 3: Bayesian phylogenetic inference under mixed models

Immune-mediated positive selection drives human immunodeficency virus type 1 molecular variation and predicts disease duration

Taxon sampling and seed plant phylogeny

Theoretical foundations of the minimum evolution method of phylogenetic inference

Statistical properties of the ordinary least-squares generalized leastsquares and minimum evolution methods of phylogenetic inference

The neighbor-joining method: A new method for reconstructing phylogenetic trees

Codon and rate variation models in molecular phylogeny

Applications of codon and rate variation models in molecular phylogeny

A note on the neighbor-joining algorithm of Saitou and Nei

Phylogenetic inference

Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees

General time-reversible distances with unequal rates across sites: Mixing gamma and inverse gaussian distributions with invariant sites

Additive evolutionary trees

Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods

Bayesian inference in molecular phylogenetics

Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method

Linear system theory