key: cord-018459-isbc1r2o
authors: Munjal, Geetika; Hanmandlu, Madasu; Srivastava, Sangeet
title: Phylogenetics Algorithms and Applications
date: 2018-12-10
journal: Ambient Communications and Computer Systems
DOI: 10.1007/978-981-13-5934-7_17
sha: 
doc_id: 18459
cord_uid: isbc1r2o

Phylogenetics is a powerful approach in finding evolution of current day species. By studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. The phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. The paper has also discussed the application of phylogenetic study in disease diagnosis and evolution.

Phylogenetics can be considered as one of the best tools for understanding the spread of contagious disease, for example, transmission of the human immunodeficiency virus (HIV) and the origin and subsequent evolution of the severe acute respiratory syndrome (SARS) associated coronavirus (SCoV) [1] . Earlier, morphological traits were used for assessing similarities between species and building phylogenetic trees. Presently, phylogenetics relies on information extracted from genetic material such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or protein sequences [2] . Methods used for phylogenetic inference have changed drastically during the past two decades: from alignment-based to alignment-free methods [3] . This paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. A brief review of phylogenetic tree applications is also given in cancer studies.

A phylogenetic tree can be unrooted or rooted, implying directions corresponding to evolutionary time, i.e. the species at the leaves of a tree relate to the current day species. The species can be expressed as DNA strings which are formed by combining four nucleotides A, T, C and G (A-adenine, T-thymine, C-cytosine and G-guanine). In literature, various string processing algorithms are reported which can quickly analyse these DNA and RNA sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. A high similarity among two sequences usually implies significant functional or structural likeliness, and these sequences are closely related in the phylogenetic tree. To get more precise information about the extent of similarity to some other sequence stored in a database, we must be able to compare sequences quickly with a set of sequences. For this, we need to perform the multiple sequence comparison. Dynamic programming concepts facilitate this comparison using alignment methods, but it involves more computation. Moreover, the iterative computational steps limit its utility for long length sequences [3] . Alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences.

Phylogenetic tree generation consists of sequence alignment where the resulting tree reveals how alignment can influence the tree formation. Alignment-based methodologies are probably the most widely used tools in sequence analysis problems [4] . They consist of arranging two sequences: one on the top of another to highlight their common symbols and substrings. An alignment method is based on alignment parameters including insertion, deletions and gaps which play a pivotal role in the construction of the phylogenetic tree. A phylogenetic tree is formed as an outcome of sequence analysis performed on the DNA or RNA strings [5] . Sequence comparison reveals the patterns of shared history between species, helping in the prediction of ancestral states. The comparison of sequences also helps in understanding the biology of living organisms which is required to find similarity and relationship among species. For sequence comparison, we can follow alignment-based or alignment-free methods [3, 6, 7] . 

Sequence alignment is a method to identify homologous sequences. It is categorized as pairwise alignment in which only two sequences are compared at a time whereas in multiple sequence alignment more than two sequences are compared. Alignmentbased can be global or local [8, 9] . These alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. Figure 1 gives a hierarchical view of various methods for phylogenetic tree building.

The character-based methods compare all sequences simultaneously considering one character/site at a time. These are maximum parsimony and maximum likelihood. These methods use probability and consider variation in a set of sequences [10] . Both approaches consider the tree with the best score tree, which requires the smallest number of changes to perform alignment. Maximum parsimony method suffers badly from the long-branch attraction and gives the least information about the branch lengths [10] . In such cases, if two external branches are separated by short internal branches, it leads to the incorrect tree. Some of the salient features of character-based methods are mentioned in Table 1 . 

Distance-based methods use the dissimilarity (the distance) between the two sequences to construct trees. They are much less computationally intensive than the character based methods are mostly accurate as they take mutations into count. For tree generation, generally, hierarchical clustering is used in which dendrograms (clusters) are created. Table 1 briefly compares various phylogenetic tree construction methods.

Multiple alignments of related sequences may often yield the most helpful information on its phylogeny. However, it can produce incorrect results when applied to more divergent sequence rearrangements [3] . Some computationally intensive multiple alignment methods align sequences strictly based on the order in which they receive them. Multiple sequence alignment methods emphasize that more closely related sequences should be aligned first. In cases of sequences being less related to one another, however, sharing a common ancestor may be clustered separately [11] . This implies that they can be more accurately aligned, but may result in incorrect phylogeny. Alignment can provide an optimized tree if a recursive approach is followed; however, this will increase the complexity of the problem. If the differences among the lengths of sequences are very high, the alignment performance significantly impacts tree generation. The use of dynamic programming in alignment makes computation more complicated, and iterative steps limit their utility for large datasets. Therefore, consistent efforts have been made in developing and improving multiple sequence alignment methods for supporting variable length sequences with high accuracy and also for aligning a larger number of sequences simultaneously. Because of the problems associated with alignment-based phylogeny the importance of alignment-free methods is apparent [3] . Hence, the alignment quality affects the relationship created in a phylogenetic tree based on the consideration discussed above.

Alignment-free methods proposed in recent years can be classified into various categories as shown in Fig. 1 . These include k-tuple based on the word frequencies, methods that represent the sequence without using the word frequencies, i.e. compression algorithms probabilistic methods and information theory-based method. In the k-tuple method, a genetic sequence is represented by a frequency vector of fixed length subsequence and the similarity or dissimilarity measures are found based on the frequency vector of subsequence. The probabilistic methods represent the sequences using the transition matrix of a Markov chain [12] of a pre-specified order, and comparison of two sequences is done by finding the distance between two transition matrices. Graphical representation comprising 2D or 3D or even 20D methods provides an easy way to view, sort and compare various sequences. Graphical representation further helps in recognizing major characteristics among similar biological sequences.

As discussed k-tuple method uses k-words to characterize the compositional features of a sequence numerically. A biological sequence is numerically converted into a vector or a matrix composed of the word frequency. The k-word frequency pro-vides a fast arithmetic speed and can be applied to full sequences. The problem with k-tuple is a big value of k that poses a challenge in the computing time and space, and k-word methods underestimate or even ignore the importance of its location. The string-based distance measure uses substring matches with k mismatches.

Cancer research is considered one of the most significant areas in the medical community. Mutations in genomic sequences are responsible for cancer development and increased aggressiveness in patients [13, 14] . The combination of all such genes mutations, or progression pathways, across a population can be summarized in a phylogeny describing the different evolutionary pathways [9] . Application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [14, 15] . Discovery of genes associated in cancer subtype help researchers to map different pathways to classify cancer subtypes according to their mutations. Methods of phylogenetic tree inference have proliferated in cancer genome studies such as breast cancer [13] . Phylogenetic can capture important mutational events among different cancer types; a network approach can also capture tumour similarities.

It has been observed from the literature that in cancer disease, the driver genes change the cancer progression, and it even affects the participation of other genes thus generating gene interaction network. Phylogenetic methods can solve the problem of class prediction by using a classification tree. Phylogenetic methods give us a deeper understanding of biological heterogeneity among cancer subtype.

The research focuses on the various methods of sequence analysis to generate phylogenetic trees. The limitations associated with sequence alignment methods lead to the development of alignment-free sequence analysis. However, most of the existing alignment-free methods are unable to build an accurate tree so more refinement is required in alignment-free methods. The phylogenetic study is not limited to species evolution, but disease evolution as well. Extending phylogenetic to disease diagnosis can give birth to new treatment options and understanding its progression.

Use of phylogenetics in the molecular epidemiology and evolutionary studies of viral infections

Reconstructing optimal phylogenetic trees : a challenge in experimental algorithmics

Editorial: alignment-free methods in computational biology

Analyzing DNA strings using information theory concepts

Modified k-Tuple method for the construction of phylogenetic trees

The evolution of tumour phylogenetics: principles and practice

On maximum entropy principle for sequence comparison

A general method applicable to the search for similarity in the amino acid sequence of two proteins

Comparison of biosequences

Approximate maximum parsimony and ancestral maximum likelihood

Phylogenetic trees in bioinformatics

Weighted relative entropy for phylogenetic tree based on 2-step Markov Model

PhyloOncology: understanding cancer through phylogenetic analysis

Tumor classification using phylogenetic methods on expression data

Novel gene selection method for breast cancer classification

Sequence analysis Integrating alignment-based and alignmentfree sequence similarity measures for biological sequence classification

Hidden Markov models with applications to DNA sequence analysis

Is multiple-sequence alignment required for accurate inference of phylogeny?

Constructing phylogenetic trees using multiple sequence alignment

Phylogenetic trees in bioinformatics

Constructing phylogenetic trees using maximum likelihood

Applications and algorithms for inference of huge phylogenetic trees: a review

UPGMA clustering revisited: a weight-driven approach to transitive approximation

An improved phylogenetic tree comparison method

Neighbor-net: an agglomerative method for the construction of phylogenetic networks

Bioinformatics: a practical guide to the analysis of genes and proteins

Sequence similarity using composition method

Sequence analysis kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison