key: cord-103204-gt6upfri authors: Hölzer, Martin; Marz, Manja title: PoSeiDon: a Nextflow pipeline for the detection of evolutionary recombination events and positive selection date: 2020-05-20 journal: bioRxiv DOI: 10.1101/2020.05.18.102731 sha: doc_id: 103204 cord_uid: gt6upfri Summary PoSeiDon is an easy-to-use pipeline that helps researchers to find recombination events and sites under positive selection in protein-coding sequences. By entering homologous sequences, PoSeiDon builds an alignment, estimates a best-fitting substitution model, and performs a recombination analysis followed by the construction of all corresponding phylogenies. Finally, significantly positive selected sites are detected according to different models for the full alignment and possible recombination fragments. The results of PoSeiDon are summarized in a user-friendly HTML page providing all intermediate results and the graphical representation of recombination events and positively selected sites. Availability and implementation PoSeiDon is freely available at https://github.com/hoelzer/poseidon. The pipeline is implemented in Nextflow with Docker support and processes the output of various tools. Contact hoelzer.martin@gmail.com Selection pressure continuously influences the evolution of genes and can be studied in many ways (Vitti et al., 2013) . For example, positive or diversifying selection can be detected by comparing the rates of non-synonymous (dN ) and synonymous substitutions (dS) in an alignment of orthologous genes. Over several sites (codons), the dN/dS ratio (or ω) can reach values well above 1 (Yang, 2007) , and such sites are likely to be positively selected. For instance, specific amino acid changes are favored if they increase the host's fitness against a pathogen (Fumagalli et al., 2011) . Alternatively, the genes of a pathogen are affected, as in the COVID 19 pandemic, where positively selected sites in the spike protein of the virus gave cause for concern (Korber et al., 2020) . The detection of positive selection enables researchers to gain insights into the evolution of genes and thus develop countermeasures against pathogens that are in a constant 'arms-race' with their host. Since recombination can have a profound influence on evolutionary processes and can adversely affect phylogenetic reconstruction and the accurate detection of positive selection (Shriner et al., 2003) , screening for breakpoints to define recombinant parts within an alignment should be a standard step in any comparative evolutionary study. A comprehensive evolutionary analysis of significantly positively selected sites consist of several complicated steps, including (1) in-frame alignment; (2) Indel correction; (3) phylogenetic tree calculation; (4) selection of a best-fitting nucleotide substitution * To whom correspondence should be addressed. model; (5) detection of topological incongruence and breakpoint selection to describe putative recombination events; (6) calculation of positively selected sites (ω > 1) under varying models; (7) and their impact on the selective pressure acting on the whole alignment. Thus, such an analysis involves dozens of different tools and parameter settings. In addition, the results of many well-established and widely used tools in this field of evolutionary science are not easy to interpret and process. Especially, the accurate detection and handling of putative recombination events is a challenging but essential task. Currently, only a few tools for the comprehensive detection of positive selection exist. These tools either do not automatically combine all of the described steps (Delport et al., 2010) , do not take possible recombination events into account (Doron-Faigenboim et al., 2005; Stern et al., 2007; Webb et al., 2017) , or focus only on the detection of positive selection in prokaryotic genomes (Su et al., 2013) . Here we present PoSeiDon, a pipeline that allows researchers to perform comprehensive evolutionary studies by automatically taking care of all tasks mentioned above. PoSeiDon does detect not only positively selected sites in an alignment of homologous sequences but also possible recombination events that could otherwise adversely affect the positive selection detection. The input is a single FASTA file consisting of protein-coding DNA sequences with a correct open reading frame. The output is summarized in T E X and PDF format and visualized in a user-friendly HTML page, providing access to all results and intermediate files. PoSeiDon comprises an assembly of different scripts and tools ( Fig. 1) that allow for the detection of recombination and positive selection in protein-coding sequences. Each third-party tool encapsulates in a Docker container and all steps are connected in a Nextflow (Di Tommaso et al., 2017) implementation for full parallelization and simple execution. If Nextflow and Docker are configured, PoSeiDon can be installed and run with a single command: nextflow run hoelzer/poseidon --help. Different profiles allow the reliable execution of PoSeiDon in the cloud or on a cluster system. Starting from homologous coding sequences provided by the user, we build a multiple sequence alignment guided by amino acid information with TranslatorX (v1.1) (Abascal et al., 2010) , using Muscle (v3.8.31) (Edgar, 2004) to align the sequences. The resulting in-frame nucleotide alignment is cleaned for Indels. A best-fitting substitution model is selected c Oxford University Press 2016. using MODELTEST (Posada and Crandall, 1998) , which is part of the HyPhy suite (v2.2.7) (Pond et al., 2005) . Possible recombination events and corresponding breakpoints in the alignment are detected using GARD (Pond et al., 2006b,a) under the previously selected substitution model. All breakpoints are tested for significant topological incongruence using a Kashino Hasegawa (KH) test (Kishino and Hasegawa, 1989) . KHinsignificant breakpoints most frequently arise from variation in branch lengths between segments. However, we observed interesting positively selected sites in fragments without any significant topological incongruence (Fuchs et al., 2017) . Thus, KH-insignificant breakpoints can be taken into account and are marked in the final output, as they might not occur from real recombination events. Positions of putative breakpoints that would destroy the open reading frame are adjusted to the next valid position. Phylogenetic reconstructions on the full alignment and all fragments are performed with RAxML (v8.2.12) (Stamatakis, 2014 ) using the GTRGAMMA model for nucleotide sequences and PROTGAMMAWAG for amino acids. All calculations are performed with 1 000 bootstrap replicates. The user can apply optional outgroup rooting. The Newick Utilities suite (v1.6) (Junier and Zdobnov, 2010 ) is used to visualize the calculated trees in different formats. Positive selection is analyzed on the full alignment and each of the fragments separately. Maximum-likelihood tests to detect positive selection under varying site models are performed with CODEML (M1a vs. M2a, M7 vs. M8) implemented within the PAML suite (v4.9) (Yang, 2007) . Furthermore, we implemented the M8a vs. M8 test proposed by Swanson et al. (2003) as an additional model test in PoSeiDon. The different statistical site models that do not allow (neutral models) or allow (selection models) a class of codons to evolve with ω > 1 are compared. Furthermore, varying codon frequency models are applied to simulate different nucleotide substitution rates. A gene is declared to be positively selected if the neutral model can be rejected in favor of the positive selection model based on a likelihood ratio test. Then, a Bayes empirical Bayes (BEB) approach (Yang et al., 2005) is applied to calculate posterior probabilities (P P ) that a codon comes from the site class with ω > 1. Positively selected sites with an assigned P P > 0.95 are depicted as significant. We graphically summarize all positively selected sites under varying frequency models in the output (Fig. 1) . Thus, we allow the user to investigate sites that would be dismissed from the output when using a P P threshold. For example, such sites could be located in regulatory domains of the final protein, yielding a lower P P value due to insufficient species sampling (McBee et al., 2015) . The final output of PoSeiDon is based on a heavily modified version of the TranslatorX HTML output. The amino acid color code is adapted from TranslatorX. Here we present PoSeiDon, an easy-to-use Nextflow pipeline for the accurate detection of site-specific positive selection and recombination events in protein-coding sequences. The input is a multiple FASTA file of homologous coding sequences that is automatically transferred into a codon-based alignment. Since recombination can have a profound impact on the evolutionary history of sequences, we initially check the alignment for topological incongruence to define putative recombination breakpoints. The whole evolutionary analysis of PoSeiDon is performed independently for the full alignment and all possible fragments. PoSeiDon automatically calculates maximum likelihood-based phylogenetic trees for all alignments, estimates ω values at each site, and computes their impact on the positive selection. All identified sites and their significance values are projected onto the codon and amino acid alignment of the input sequences to allow visual identification of evolutionary hot-spots with high ω values. Additionally, publication-ready PDF and L A T E X tables are provided, including all breakpoints and significantly positively selected sites. All results are summarized in a userfriendly and clear manner, allowing researchers to study positive selection. Funding: This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) through the Priority Program SPP-1596 MA5082/7-1 and was partly conducted within the Collaborative Research Centre AquaDiva (CRC 1076 AquaDiva) of the Friedrich Schiller University Jena, also funded by the DFG. MH appreciates the support of the Joachim Herz Foundation by the addon fellowship for interdisciplinary life science. We would like to thank Lasse Faber for his help in implementing the Dockers. Conflict of Interest: none declared. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology Nextflow enables reproducible computational workflows Selecton: a server for detecting evolutionary forces at a single amino-acid site MUSCLE: multiple sequence alignment with high accuracy and high throughput Evolution and antiviral specificities of interferon-induced Mx proteins of bats against Ebola, Influenza, and other RNA viruses Signatures of environmental genetic adaptation pinpoint pathogens as the main selective pressure through human evolution The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea The Effect of Species Representation on the Detection of Positive Selection in Primate Gene Data Sets HyPhy: hypothesis testing using phylogenies Automated phylogenetic detection of recombination using a genetic algorithm GARD: a genetic algorithm for recombination detection MODELTEST: testing the model of DNA substitution Potential impact of recombination on sitewise approaches for detecting positive natural selection RAxML version 8: a tool for phylogenetic analysis and postanalysis of large phylogenies Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach PSP: rapid identification of orthologous coding genes under positive selection across multiple closely related prokaryotic genomes Pervasive adaptive evolution in mammalian fertilization proteins Detecting natural selection in genomic data VESPA: very large-scale evolutionary and selective pressure analyses PAML 4: phylogenetic analysis by maximum likelihood Bayes Empirical Bayes Inference of Amino Acid Sites Under Positive Selection