key: cord-335155-x9az3twa authors: Qi, Zhen; Hu, Yu; Li, Wei; Chen, Yanjun; Zhang, Zhihua; Sun, Shiwei; Lu, Hongchao; Zhang, Jingfen; Bu, Dongbo; Ling, Lunjiang; Chen, Runsheng title: Phylogeny of SARS-CoV as inferred from complete genome comparison date: 2003 journal: Chin Sci Bull DOI: 10.1007/bf03183930 sha: doc_id: 335155 cord_uid: x9az3twa SARS-CoV, as the pathogeny of severe acute respiratory syndrome (SARS), is a mystery that the origin of the virus is still unknown even a few isolates of the virus were completely sequenced. To explore the genesis of SARS-CoV, the FDOD method previously developed by us was applied to comparing complete genomes from 12 SARS-CoV isolates to those from 12 previously identified coronaviruses and an unrooted phylogenetic tree was constructed. Our results show that all SARS-CoV isolates were clustered into a clique and previously identified coronaviruses formed the other clique. Meanwhile, the three groups of coronaviruses depart from each other clearly in our tree that is consistent with the results of prevenient papers. Differently, from the topology of the phylogenetic tree we found that SARS-CoV is more close to group 1 within genus coronavirus. The topology map also shows that the 12 SARS-CoV isolates may be divided into two groups determined by the association with the SARS-CoV from the Hotel M in Hong Kong that may give some information about the infectious relationship of the SARS. SARS-CoV, as the pathogeny of severe acute respiratory syndrome (SARS), seems to be the first coronavirus that is lethal to humans. Coronavirus (family Coronaviridae, genus Coronavirus) is an enveloped, single-stranded plus sense RNA virus whose genome has approximately 30 kb size. Whereas coronaviruses may cause severe disease in animals, coronaviruses human strains only cause mild diseases until SARS-CoV was discovered. To date, SARS-CoV genomes from 12 isolates have been completely sequenced and released [I-4] . Preliminary analysis of SARS-CoV genome indicated that the virus is not phylogenetic closely related to any previously identified coronaviruses. Few obvious clues were given by the genome sequence to answer an important question: what is the origin of SARS-CoV? Based on alignment of amino acid sequences or nucleotide acid sequences, some hypotheses were brought forward to elucidate the origin of SARS-CoV. However, the distant relationship of SARS-CoV to any known virus inferred from very low score of alignment makes these assumptions worthless. Coronaviruses were classified into three groups a;cording to the serotypes: groups 1 and 2 contain mammalian viruses, while group 3 contains only avian viruses [5, 6] . Based on the analysis of phylogeny from predicted proteins of SARS-CoV, Rota et alP] claimed that SARS-CoV does not closely resemble any of these three groups and suggest the 4th group for SARS-CoV. Some other authors arrived the same conclusion by analyzing proteins of other isolates[I, 3, 4] . To resolve the uncertain infectious relationship among different SARS-CoV strains, Ruan et alP] compared genomes from 14 SARS-CoV isolates and identified 129 sequence variations among them. Combined with the knowledge of contact source history and geography, common variant sequences were used as genetic signatures to reconstruct a probable lineage map of the SARS-CoV infections. They concluded that the case associated with infections originating in Hotel M in Hong Kong form a group while other isolates form the other one. However, some details are still unclear. Meanwhile, due to the limitation of data, Ruan et al. have to restrict their research to 26140 loci. In addition, some of these 129 mutations might have occurred during in vitro expansion and might be sequencing errors rather than true ones [7] . To address such issues, a theoretical method (named FDOD) that we previously developed based on Shannon's defmition of information, entropy and degree of disagreement is used. FDOD calculates species specific complete information set (CIS) from its primary sequence of whole genome, thus circumambulate alignment and avoid any bias that may be associated with particular genomic regions. Primary sequence of a genome is the result of its evolutionary history. The more closely phylogenetic related two species are, the more similar sequences they should have. Hence, CIS can be regarded as a reasonable measure of species distance [8, 9] . The software we developed and the supplementary material are available upon request. To date, genomes from 12 SARS-CoV isolates and 12 previously identified coronaviruses have been completely sequenced. We download these genomes from anonymous ftp server (ftp://130.14.22.5/genbank/genomes). Table 1 gives the related information such as accession number, host, source, group, etc. The primary sequences of coronavirus genomes are subjected to FDOD software to calculate the distance matrix based on their discrepancy of CIS[8J. Then, the NEIGHBOR based on neighbor joining algorithm in PHYLIP 3.6 package was used to construct the unrooted tree from distance matrix. To generate multiple data sets for evaluating robustness of the branches of the tree, we adopted the Jackknife algorithm to randomly resample[loJ. Finally, the consensus tree is produced using CONSENSE in PHYLIP 3.6. The unrooted phylogenetic tree was constructed for genomes from 12 SARS-CoV isolates and that from 12 previously identified coronviruses (Fig. 1) . It can be split into two parts at the point indicated by the arrow in Fig. 1 . All SARS-CoV isolates are located at one side while 12 coronviruses are at the other part. Consistent with the result of Rota et alyJ, the three groups of coronaviruses depart from each other clearly in our tree. The bootstrap value at the divergent point of these three groups is 92% (bootstrap values higher than 70% correspond to a probability higher than 95%[1I J ). Our result indicated that SARS-CoV is closer to group 1 of the coronaviruses than to the other two groups (Fig. 2 ) by the support with the high bootstrap value (bootstrap value is 97% for clad of group 1 and 81 % at divergent point of cPig_l and SARS_TOR2). Differently, 1176 Rota et al. regard SARS-CoV as a distinct group within the genus Coronavirus based on alignment of amino acid sequences[I-4 J . Prevenient paper shows that poorly conserved or variable-length region is not reliable for phylogeny construction based on aligmnent [12,13 J . Since the similarity between SARS-CoV and coronaviruses was very 10w[2 J , the new method would be needed to build the phylogenetic tree. The results are based on our new method that circumambulate alignment of sequences and the results are supported moderately by some evidence related to serology. Ksiazek et alY4J performed immunohistochemical assays with various antibodies reactive with coronaviruses from three groups and with the immune serum specimen from a SARS patient. Their result demonstrated strongly cytoplasmic and membranous staining of infected cells with antibodies related to coronaviruses of group 1 while no staining was identified with any antibodies related to coronaviruses of groups 2 and 3. Hemagglutinin esterase gene, which presents in all coronaviruses of group 2 and some of group 3, does not exist in SARS-CoVand coronaviruses of group 1[2 J , that also support our result that SARS-CoV is closer to coronaviruses of group 1. From the result one may assume that the origin of SARS-CoV may be more related to the coronaviruses of group 1 than to those ofthe other two groups. Coronaviruses of group 1 cover a wide range of hosts, .. however, we cannot determine which one may be the normal host carrying SARS-CoV. Possibly, SARS-CoV might come from an unknown animal that is not the host ofpreviously identified coronaviruses. Since the taxon sampling is an important factor nfluencing the branching pattern of a tree [15] , we also construct unmoted trees from different samplings to inspect the robustness of our results and underlying method. Very similar results were acquired that corroborate their robustness ofthe method (data not shown). The result shows an infectious relationship map for 12 SARS-CoV isolates which is very similar to that drawn by Ruan et alP] (Fig. 3) . 12 SARS-CoV isolates form two main groups (designated "Local spread" and "Global spread") determined by the association with the exposure at Hotel M in Hong Kong. The SARS_SG_3, as a secondary contact case, looks obviously like an ancestral strain among 5 isolates that came from Singapore. However, it is also consistent with the results of Ruan et al. and they attributed it to a potentially back mutational event during the transmission of the virus. SARS_TWI is placed within the cluster "Global spread" which includes the isolates that are directly or indirectly associated with the infection at Hotel M such as SARS_TOR2, SARS_ HK_2, SARS_Urban, etc. We cannot decide which one of two main groups SARS_HK_1 should be located in, since its contact history is not available. The genome se-1178 quence of the SARS-associated coronavirus Characterization of a novel coronavirus associated with severe acute respiratory syndrome Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJOI) Coronaviridae, in Virus Taxonomy, Classification and Nomenclature of Viruses Coronavmidae: The viruses and their replication Effects of passage history and sampling bias on phylogenetic reconstruction of inman influenza A evolution The characterization of a measure of infonnation discrepancy Phylogeny based on whole genome as inferred from complete infonnation set analysis Jackknife, bootstrap and other resampling plans in regression analysis Bull, 1. 1., An emprical test of bootstrapping as a method for assessing confidence in phylogenetic analysis Alignment-ambiguous nucleotide sites and the exclusion of systematic data Elision: a method for accommodating multiple molecular sequence alignments with alignmentambiguous sites A novel coronavirus associated with severe acute respiratory syndrome Taxon sampling and the accuracy of large phylogenies