key: cord-253331-z443e8lk authors: Stanhope, Michael J.; Brown, James R.; Amrine-Madsen, Heather title: Evidence from the evolutionary analysis of nucleotide sequences for a recombinant history of SARS-CoV date: 2004-03-31 journal: Infection, Genetics and Evolution DOI: 10.1016/j.meegid.2003.10.001 sha: doc_id: 253331 cord_uid: z443e8lk Abstract The origins and evolutionary history of the Severe Acute Respiratory Syndrome (SARS) coronavirus (SARS-CoV) remain an issue of uncertainty and debate. Based on evolutionary analyses of coronavirus DNA sequences, encompassing an approximately 13kb stretch of the SARS-TOR2 genome, we provide evidence that SARS-CoV has a recombinant history with lineages of types I and III coronavirus. We identified a minimum of five recombinant regions ranging from 83 to 863bp in length and including the polymerase, nsp9, nsp10, and nsp14. Our results are consistent with a hypothesis of viral host jumping events, concomitant with the reassortment of bird and mammalian coronaviruses, a scenario analogous to earlier outbreaks of influenzae. Recently, healthcare institutions around the world, particularly in Asia and Canada, have been forcibly challenged to respond to sudden outbreaks of Severe Acute Respiratory Syndrome (SARS). SARS is a highly communicable, and often lethal, illness thought to be caused by a novel type of coronavirus Ksiazek et al., 2003; Kuiken et al., 2003) , a group of positive, single-stranded RNA viruses known to infect domestic birds and mammals, including humans. The origin of the SARS coronavirus (SARS-CoV) has been the subject of much speculation. One of the leading hypotheses is that SARS-CoV is a hybrid strain (Enserink, 2003) , since there are reports of recombination in avian coronaviruses (Lee and Jackwood, 2000) , however, until a recent report in this journal (Rest and Mindell, 2003) , there was no evidence that SARS-CoV is a recombinant. Our analysis of this question, completed at the time of publication of the Rest and Mindell paper, differs from their work in the choice of methods, the extent of the genome analyzed, taxon sampling, and in the analysis of nucleotides rather than amino acids. Our results act to both corroborate and extend their findings, adding further support to the idea that SARS has had a recombinant history involving different coronavirus lineages and suggest the possibility that the genome could have arisen through a combination of host jumping and recombination events in a manner analogous to previous outbreaks of influenzae (Gregory et al., 2003; Zhou et al., 1999) . Many of the molecular evolutionary methods for detection of recombination events involve the analysis of multiple DNA sequence alignments. In choosing coronavirus sequences for our analyses, we made an effort to maximize both genetic diversity of the different coronavirus variants, as well as the length of possible contiguous comparative data (i.e. in excess of 20 kb). We aligned (ClustalW; Thompson et al., 1994) a large portion of the SARS virus TOR2 strain, at the DNA sequence level, between positions 7349-20969, to other coronaviruses from previously designated groups I, II, and III Marra et al., 2003; Rota et al., 2003) . At the time of manuscript submission, there were 36 complete, or nearly complete, genomes of SARS virus available, all of which were highly similar at the DNA sequence level, thus strain selection does not affect the results of our analyses. The DNA sequence alignments within this region had a few segments which could not be reliably aligned, and thus were excluded from our analyses. This resulted in 13 separate DNA alignments, which ranged in length from 245 to 3785 bp. Within each of these sub-alignments, any further ambiguous regions were deleted before recombination detection analyses. This was performed in a highly conservative manner, such that not only did we remove any and all remotely ambiguous gaps, but the regions surrounding the gaps were additionally excluded up to areas of clearly anchored sequence alignment (identical or virtually identical stretches of sequence) flanking either side of the gap (alignments available upon request). We used the recombination detection program PLATO (Grassly and Holmes, 1997) which employs a maximum likelihood (ML) approach to demarcate the boundaries of anomalous evolving regions in a DNA sequence alignment, with statistical measures of confidence. PLATO has a phylogenetic basis, and such methods have been shown to be somewhat less powerful than substitution distribution methods, in the sense that they are less able to identify more subtle examples of recombination (Posada et al., 2002; Posada and Crandall, 2001) . However, this in turn means that such approaches are also more conservative in their overall assessment, and indeed phylogenetic methods can only detect recombination events that change the topology (Posada et al., 2002; Posada and Crandall, 2001) . Importantly, the propensity for most recombination detection programs, including PLATO, to detect false positives appears to be low (Posada et al., 2002; Posada and Crandall, 2001) . PLATO was used to assess possible recombinant regions for each of the 13 alignments, employing parameters of an HKY model of sequence evolution, five steps for the sliding window, and 1000 replications of Monte Carlo simulation. To add a further level of conservative assessment to our recombination detection, phylogenetic analyses were performed on all partitions identified by PLATO, the putative non-recombinant portions of such alignments, as well as all the remaining alignments. For all of these phylogenetic analyses, the best fitting model of sequence evolution and the corresponding values for the rate matrix, shape of the gamma distribution, and proportion of invariant sites were estimated by the program MODELTEST (Posada and Crandall, 1998) . The evolutionary history of each region was compared to the control phylogeny, which was based on a concatenation of the 13 alignments. This control topology was the same as that derived from the concatenated non-recombinant sequence portions. A region was concluded as a SARS-CoV recombinant when all, or at least the majority (for shorter sequences), of phylogenetic methods agreed in their convincing placement of SARS-CoV in an alternative position to that of the control phylogeny. Phylogenies were reconstructed using Bayesian (Huelsenbeck and Ronquist, 2001) , maximum likelihood, neighbor joining (NJ, log det distances) and maximum parsimony methods, implemented in PAUP * 4.0b (Swofford, 2002) . For ML, starting trees were obtained via neighbor joining and for parsimony analyses addition sequence was employed with 10 random input orders. Tree-bisection reconnection (TBR) was the branch-swapping algorithm used in all analyses. Gaps were coded as missing data in all analyses. Bootstrap support values were obtained with 1000 replicates for maximum parsimony and neighbor joining analyses and 100 replicates for ML. Bayesian analyses were performed using Mr. Bayes (Huelsenbeck and Ronquist, 2001) with 500,000 generations, sampling frequency every 100 generations, four Markov chains, random starting trees, and a burn-in of 100,000 generations. The PLATO results were corroborated using split decomposition analysis (program SplitsTree; Huson, 1998) and bootscanning (Salminen et al., 1995) (program BOOTSCAN within the SimPlot package). Instances identified by PLATO as possible SARS-CoV recombinants were similarly identified by SplitsTree and bootscanning. In the unrooted control phylogeny, SARS-CoV branches, with convincing support, along the lineage leading to group II coronaviruses (Fig. 1a) , which is in agreement with previous reports Marra et al., 2003; Rota et al., 2003) . The long branch separating SARS-TOR2 from the group II coronaviruses, in comparison to the branch lengths separating the various group II representatives, is in general agreement with earlier opinions for SARS-CoV as a new, fourth group of coronaviruses (Marra et al., 2003; Rota et al., 2003) , and contrary to Snijder et al. (2003) who suggest, based on analysis of replicase ORF1b, that SARS-CoV is more aptly considered a distant member of group II. For the individual alignments the models of sequence evolution identified by MODELTEST were GTR+gamma (alignments corresponding with TOR2 coordinates: 10, 645-10,902; 12,613-13,344; 13,725-14,147; 20,100-20,984; and recombinant regions: 15,259-15,342; 19,577-19,862) , 125; 13, 610) , GTR + gamma + invariants (7366-7710; 10,147-10,626; 11,554-11,973; 11,989-12,516; 18,117-18,980; 14,172-17,936; 19,065-19,871) , or HKY + gamma (recombinant region: 15,974-16,108). Under our recombination criteria, several regions of recombination were evident, involving two alternative positions of SARS-CoV (Fig. 1b and c) . These two branching arrangements were SARS-CoV on the branch leading to group III viruses (avian) or as sister lineage to the group I clade (porcine, human, etc.). PLATO identified anomalous regions included 15,259-15,342 (Z value of 5.0666; Z values greater than 3.8896 judged to be significant), 15,974-16,108 (Z value of 4.3997; Z values greater than 3.8896 judged to be significant), and 19,577-19,862 (Z value of 6.1619; Z values greater than 3.6471 judged to be significant). Phylogenetic analysis of 15,259-15,342 supported SARS-CoV with group III (Fig. 1b) , whereas 15,974-16,108 supported SARS-CoV with group I (Fig. 1c) . Phylogenetic analysis of the third putative recombinant region identified by PLATO (i.e. 19, 862; Fig. 1d) , proved inconclusive, with ML and Bayes supporting SARS-CoV with group I, and parsimony and NJ yielding the control topology (bootstrap support under 60%, and Bayesian posterior probability less than 0.50). Three further recombinant regions were identified by phylogenetic analysis, that did not yield significant PLATO results, simply because the entire (or very nearly) alignment appears to represent a recombinant zone (i.e. nothing for PLATO to identify as anomalous; Fig. 1d ). Mutational saturation at synonymous positions of codons can be ruled out as a possible explanation for the alternative branching arrangements of these five (possibly six) recombinant zones, because phylogenies for these same regions based on alignments that exclude third codon positions, as well as amino acid sequences, yielded identical topologies. The resulting genomic picture suggests a complex evolutionary history of recombination involving SARS-CoV (Fig. 1d) . The placement of SARS-CoV on the branches leading to groups I or III and not nested within these groups indicates that either the recombination events are ancient in nature or the donor species are not present in currently available sequence data. The inclusion of greater host species representation, which is presently possible for a few regions of the genome, such as a 922 bp region of polymerase (for which there are additional GenBank sequences from cat, dog-group I; turkey-group III; human OC43, porcine-group II) (Stephensen et al., 1999) , did not allow a more specific identification of the possible species involved, and implicated the same recombination event between positions 15,259-15,342 (Fig. 1d) . Two recent reports regarding the SARS genome suggest, based on analysis of amino acid sequences, that there is either no evidence for recombination or no evidence for recent recombination involving other coronaviruses (Marra et al., 2003) . Although the methodological details regarding recombination detection are scant in both these reports, we gather that in the one case they came to this conclusion by comparing branching arrangements between gene trees (Marra et al., 2003) , and in the other case by performing an amino acid similarity plot . In the first case, a comparison of gene trees would not pick up recombination events that crossed gene boundaries, or which involved relatively short stretches of sequence within a gene. In the second instance, similarity plots will only tend to pick up recombination events in comparisons that involved the actual donor, a close relative to the donor, and/or a recent event. In contrast, our analysis agrees with Rest and Mindell (2003) in identifying recombination in RDRP (RNA dependent RNA polymerase), although our approach tends to suggest more specific break-points, and a larger number of smaller recombinant regions than does their analysis (three regions in RDRP: 13, 610; 15, 342; 15, 108 , based on TOR2 coordinates). We also identified several additional recombinant regions in the SARS-CoV genome, encompassing regions not analyzed by Rest and Mindell, including: 12, 344 including all of nsp9 and most of nsp10 and 18,117-18,980 of nsp14. Analyses of currently available sequences of coronaviruses, yields the conclusion that group III is exclusively composed of avian coronaviruses, while groups I and II have viruses isolated from pig, human, murine rodents, cat, dog and bovine. Our results indicate that SARS-CoV recombined with a member of the group III lineage, suggesting that an avian coronavirus was involved, a further point of general agreement between our results and that of Rest and Mindell (2003) . Other recombination events evident from our analysis, involve the branch leading to group I, which encompasses viruses from several mammalian taxa, including two very divergent strains of porcine coronaviruses. Thus, our analyses indicate that human SARS-CoV have a past history of recombination with coronaviruses hosted in distinct animal groups. Mixed animal husbandry practices, in proximity to human populations, could have led to the evolution of the SARS coronavirus and facilitated its progression as an infectious disease in humans. Novel human influenza viruses are thought to have arisen from the reassortment, within porcine hosts, of avian, swine, and human influenza viruses (Gregory et al., 2003; Zhou et al., 1999) . We suggest that our recombination results for SARS-CoV implicate a suspiciously analogous history. More specifically, SARS-CoV could have arisen from a combination of host jumping and recombinational events, involving as yet unidentified strains of avian coronavirus group III and mammalian (possibly pig) coronavirus group I. Rest and Mindell (2003) suggested host-species shifts have been relatively common in the diversification of coronavirus lineages, a result consistent with our hypothesis for SARS-CoV. Critical to determination of the evolutionary origin of SARS-CoV are expanded epidemiological surveys of wild and domestic animals, including in particular, additional avian species. Understanding the origin and evolutionary history of SARS-CoV is important to proper vaccine development as well as the epidemiological modeling of future outbreaks. Current perception of the SARS-CoV genome is one of relative genetic stability (Brown and Tetro, 2003; Ruan et al., 2003) , however, our analyses indicate that SARS-CoV has a complex history of recombination, suggesting that the genome may not be as stable as previously thought. We propose that future epidemiological modeling efforts and vaccine development take this new evidence into account. Comparative analysis of the SARS coronavirus genome: a good start to a long journey Infectious diseases. Calling all coronavirologists Aetiology: Koch's postulates fulfilled for SARS virus A likelihood method for the detection of selection and recombination using nucleotide sequences Human infection by a swine influenza A (H1N1) virus in Switzerland MRBAYES: Bayesian inference of phylogenetic trees SplitsTree: analyzing and visualizing evolutionary data A novel coronavirus associated with severe acute respiratory syndrome Evidence of genetic diversity generated by recombination among avian coronavirus IBV Evaluation of methods for detecting recombination from DNA sequences: empirical data MODELTEST: testing the model of DNA substitution Evaluation of methods for detecting recombination from DNA sequences: computer simulations SARS associated coronavirus has a recombinant polymerase and coronaviruses have a history of host-shifting Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage Phylogenetic analysis of a highly conserved region of the polymerase gene from 11 coronaviruses and development of a consensus polymerase chain reaction assay PAUP * Version 4.0b10. Sinauer Associates CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Genetic reassortment of avian, swine, and human influenza A viruses in American pigs