key: cord-312991-ypgrw78s authors: Wang, Zhi-Gang; Zheng, Zhi-Hua; Shang, Lei; Li, Lan-Juan; Cong, Li-Ming; Feng, Ming-Guang; Luo, Yun; Cheng, Su-Yun; Zhang, Yan-Jun; Ru, Miao-Gui; Wang, Zan-Xin; Bao, Qi-Yu title: Molecular evolution and multilocus sequence typing of 145 strains of SARS-CoV date: 2005-09-12 journal: FEBS Lett DOI: 10.1016/j.febslet.2005.07.075 sha: doc_id: 312991 cord_uid: ypgrw78s In this study, we have identified 876 polymorphism sites in 145 complete or partial genomes of SARS-CoV available in the NCBI GenBank. One hundred and seventy-four of these sites existed in two or more SARS-CoV genome sequences. According to the sequence polymorphism, all SARS-CoVs can be divided into three groups: (I) group 1, animal-origin viruses (such as SARS-CoV SZ1, SZ3, SZ13 and SZ16); (II) group 2, all viruses with clinical origin during first epidemic; and (III) group 3, SARS-CoV GD03T0013. According to 10 special loci, group 2 again can be divided into genotypes C and T, which can be further divided into sub-genotypes C1–C4 and T1–T4. Positive Darwinian selections were identified between any pair of these three groups. Genotype C gives neutral selection. Genotype T, however, shows negative selection. By comparing the death rates of SARS patients in the different regions, it was found that the death rate caused by the viruses of the genotype C was lower than that of the genotype T. SARS-CoVs might originate from an unknown ancestor. In 2002, a new infectious disease was found in the southern part of China which had high mortality and morbidity and was termed ''severe acute respiratory syndrome'' (SARS). It affected 8096 people and led 774 patients to death around the world, within a very short span of time [1] . Several studies have shown that SARS is caused by a new member of the coronavirus family. The SARS coronavirus (SARS-CoV) was supposed to be transmitted from wild animals [2] [3] [4] . This hypothesis was then supported by the identification of a SARS-CoV-like virus in civet cats, sold in markets in south China. This isolate had more than 99% sequence identity with SARS-CoV, indicating that the virus was recently transferred from animals to human beings [5] . However, according to a recent report there is no direct evidence that the civet cat virus is the origin of the SARS-CoV [6] . Although unlikely, the possibility that SARS-CoV positive animals were infected from humans cannot be formally excluded, and it was indeed reported that SARS-CoV was transmitted from man to pig [7] . The mechanism how the pathogen broke the barrier between its natural reservoir and man is still unclear [8] . After the first SARS epidemic, there were three incidences of laboratory infections in Singapore, Taiwan and Beijing [9] [10] [11] . Furthermore, in Guangdong, China, SARS patients were observed to suffer from ''mild clinical symptoms'' [12] . It is therefore important to further explore the characteristics of the SARS-CoV genome and trace the source of the epidemic. There were several studies on SARS-CoV phylogeny and genotyping published immediately after SARS was identified. According to 4 polymorphism sites, Ruan et al. [13] first genotyped 14 previously published SARS-CoV genomes at the beginning of May, 2003. Subsequently, Li et al. [14] proposed genotypes of C and T in 17 SARS-CoV genomes based on 5 polymorphism sites. Regional and international transmission types were also proposed [15, 16] . SARS-CoV Su-10 and CUHK-W1 were genotyped into two different groups even though they were all isolated in Hongkong [17] . Based on the 7 loci of 44 SARS-CoV genomes, C and T genotypes were further modified and named as the Yexin and Xiaohong genotypes [18] . According to 5 polymorphism sites, 63 SARS-CoV genomes were divided into early, middle and late phase genotype groups [19] . The polymorphism sites analyzed in these papers were almost the same. Three successful examples have been reported where infection chains have been traced by comparing variations in SARS-CoV genomes. The first SARS-CoV laboratory infection incidence was confirmed by identifying 13 variant loci in the SARS-CoV genomes [9] . Secondly, it was found that the SARS-CoV isolated from the patient with ''mild clinical symptoms'' and the SARS-CoV (or SARS-CoV-like virus) isolated from animals resembled each other in the sequence encoding the spike protein [19] . This study indicates that virus-carrying animals may be a risk for the investigator. Thirdly, sequence analysis confirmed the laboratory infection of a technician who was working with SARS-CoV [20] . Up to date, there are already 295 complete or partial SARS-CoV genome sequences available in GenBank. A new method of genotyping is proposed in this study. The results show that all SARS-CoV genomes are clustered into three groups and that SARS-CoV genomes in the second group belong to genotypes C and T. These two genotypes can be further divided into 8 sub-genotypes. In this study we also analyzed the SARS death rate, SARS-CoV origin, and population genetics. SARS-CoV ZJ01 (Accession No. AY297028) was isolated in Zhejiang province, China [14] . Other SARS-CoV genome sequences were downloaded from NCBI GenBank database. Two hundred and ninety-five complete or partial SARS-CoV genomes were searched. After removing shorter and repeated records, 101 complete genomes and 44 partial genome sequences were collected for final analysis (by September 20, 2004) . The detailed information of sequences of the earliest 20 records in GenBank is shown in Table 1 and the other sequences are listed in Table S1 (Supplementary materials). The newest information about domestic and international SARS fatality rates were downloaded from Health Ministry of China (http:// 168.160.224.167/) and WHO websites (http://www.who.int/csr/sars/ country/Table2004_04_21/en/). The analysis platform was PC server based on P4 ultra thread technological CPU and the operating system was Windows XP. Clu-stalW version 1.83 was used for multiple sequence alignments. Tree-View (Win32) version 1.6.6 and MEGA 2.0 were used to draw phylogenetic tree. DNASP 4.0 was used to analyze single nucleoside polymorphism. The SARS-CoV sequences were arrayed with Clu-stalW program, and alignment was further manually examined and adjusted. Only those variant sequence loci that were present in at least two independent sequences were selected for further analyses [13, 18, 19] . Deletion and insertion in SARS-CoV genomes and the neutral mutation in the spike protein gene, etc., were also estimated. The latest version (NC_004718) of the first submitted SARS-CoV genome sequence (SARS-CoV TOR2, AY274119) was used as the control reference. 3.1. The distribution of single nucleotide polymorphism loci of SARS-CoV genome Eight hundred and seventy-six mutation loci were identified among the 145 complete or partial SARS-CoV genomes (the over all mutation rate was 2.94%, 876/29 751), of which 174 loci were identified in more than two genome sequences. To avoid errors that might be introduced by sequencing and cell culture passages, only the latter group of 174 loci was analyzed. Different characteristics of the polymorphism sites were found. The loci related to T mutation accounted for 60.9% (106/174), and C, A and G mutation loci accounted for 39.6% (69/174), 39.0% (68/174) and 27.0% (47/174), respectively. It was obvious that T mutation had the priority. Most of the loci, except for 7 sites, were limited to one pair of nucleotide variations. One hundred and twenty loci were transition mutations (78 CT, 42 AG), and 48 loci were transversion mutations. The ratio of transition mutations to transversion mutations was 2.5 (120/48), and 6 loci were of deletion mutation. The sequence deletion in SARS-CoV genome mainly occurred in the regions between sars7a to N protein by losing Key mutation loci are shown in Table 3 (identified by using nucleotide-nucleotide BLAST program at NEBI and Clustalw 1.83), and the phylogenetic tree of 174 loci ( Fig. 1) showed that, according to the polymorphism sites of C9404T, C9479T, G17564T, G19838A, A21721G, C22222T, G22517A, G23823T, T27243C and C27827T, the SARS-CoV genomes of the first epidemic can be divided into two genotypes, genotype C and T. The characteristic of the genotype C was that the genomes had at least one position with the same nucleotide as 10 loci listed above (C:C:G:G:A:C:G:G:T:C). These genomes included GD01, ZS-C, CUHK-W1, BJ01, BJ302-1 and GZ-C, etc. The characteristic of the genotype T was that the genomes must have all the 10 nucleotides of T:T:T:A:G:T:A:T:C:T. This genotype included GZ-B, ZJ01, Sin2679, Urbani, TC1, CUHK-AG01, Sino1-11, TOR2 and AS, etc. These polymorphism sites were located in 5 different regions of the replicase 1A, 1B, spike protein gene, sars6 and sars8a. Except the 4th and the 7th loci that were synonymous mutations, the others were non-synonymous mutations with amino acid changes of A/V, A/V, E/D, G/D, T/I, D/Y, P/L and R/C, respectively. It revealed the strongly biased characteristic of changes in genotypes associated with changes in phenotypes. Genotype C can be further divided into 4 sub-genotypes. Sub-genotype C1 had, besides the 10 sites mentioned above (Table 3) , C3626T, C8559T, T22207C and G22522A loci that would distinguish them from the others. In the same way, sub-genotype C2 had G3962A, G9448T and T19882C and sub-genotype C4 had T9854C site. As shown in Fig. 1 , the sub-genotype C3 is a transitional one. Genotype T could also be divided into 4 subgenotypes, sub-genotype T1, T2, T3 and T4. Besides the 10 common polymorphism sites shown above, they had 3, 4, 4 and 3 variant loci, respectively (Table 3) . ZJ01, HSR1 and Sin2679, belonging to genotype T and with more polymorphism sites identified in their genomes, located in the midst of the phylogenetic tree ( Fig. 1 ) and were not sub-genotyped in this paper (Table 3) . Only some genomes in both sub-types C1 and C2 had the 29 nt fragments. SARS-CoV genomes in sub-type T3 lacked a region of 415 nt (Fig. 1 , Table 2 ). To verify the genotyping results of the 174 loci shown above, we reconstructed the phylogenetic trees of 122 spike protein genes and 101 complete genomes ( Fig. 2A and B) . As can be seen in Fig. 2A , most SARS-CoV were clustered in accordance with their genotypes. Some partial genome sequences were situated in this tree, such as CUHK-L2, SZ1, GZ43, HSZ-A, GD03T0013, BJ302-1, HKU-65806 and LY, could not be properly located in the phylogenetic tree based on 174 loci. The genomes of GZ-C and QXC1 showed intercrossing features between genotype C and T. Some sub-genotypes, such as C1, C2, T2 and T3, were easily recognized in the spike protein gene tree, but some subgenotypes were intercrossed. In the phylogenetic tree of complete genomes, the clusters of genotype C and T and 8 Table 2 The sub-types are obvious (Fig. 2B) . One of the reasons for this might be that the spike protein genes were less informative than the complete genomes. 3.5. The evolutionary relationship among GD03T0013, the animal source viruses and the first epidemic SARS-CoVs Fig. 3 shows that the genetic distance of recent SARS-CoV genome of GD03T0013 was closer to the first epidemic SARS-CoV GZ02 than to the animal-origin SARS-CoV SZ3. among the genotypes and the groups The genetic characteristics of the genotypes were further analyzed based on the sequence polymorphism of the spike protein genes ( Table 4 ). The P a /P s value was 1.047 for the polymorphism sites among the genomes of genotype C that showed neutrality or fine selection. In genotype T, nonsynonymous mutation loci decreased, and the P a /P s value was 0.230 (P < 0.05, with significant differences between the genomes), which showed negative selection in this genotype. The P s % value of synonymous mutation, which was less influenced by the general environment selection, had no significance between the two genotypes. The P s % value was one of the characteristic indices identifying the genetic relationship of the populations, and the P s % value of group 2 (C + T) was approximately half of that of the group 1 (animal source viruses, Table 4 ). The animal source viruses (SARS-CoV SZ1, SZ3, SZ13 and SZ16) also showed negative selection among themselves. According to the phylogenetic trees based on the different data sets described above, the genetic diversities between any pair of 3 groups (the animal source viruses, the first epidemic viruses including C and T genotypes, and GD03T0013) were evaluated. The K a /K s values of each two groups were all greater than 1, showing positive Darwinian selections between these groups. On the other hand, the K a /K s values between genotype C and ZS-B (one of the first epidemic viruses and the closest to the animal origin viruses) or between genotype T and ZS-B were all less than 1 (Figs. 1 and 2A, Table 4 ). All analyses described above indicated that the animal source virus and GD03T0013 virus could be classified as two relatively independent populations. 145 SARS-CoV genome sequences could be classified into three groups, and the group 2 could further be divided into two genotypes and eight subgenotypes (Fig. 4) . Based on the new concept of dividing all the 145 SARS-CoV genome sequences into three groups, the origin of group 2 and group 3 (assuming group 1 as outer source group) was estimated. Eight candidates including four animal origin genomes of group 1 identified in April to May, 2003, three early genomes of HGZ8L1-A, ZS-B and GZ02 in sub-genotype C1 of group 2 in January to February, 2003, and the genome of GD03T0013 of group 3 in December, 2003, were selected to analyze the date when their most recent common ancestor was existed. Based on analysis of the diversities of spike protein genes, the synonymous mutation rate K s between group 1 and 2 was 0.00321 ± 0.00135, and the synonymous mutation rate K s between group 1 and 3 was 0.00509 ± 0.00199. Assuming that the synonymous mutation rate was constant, a linear regression analysis was evaluated. The linear equation of K s and occurrence time T is K s = 0.000171T + 0.00321 (from January, 2003, with one month intervals), and the date for the most recent common ancestor was esti- TC2 TC3 TC1 TW10 TW11 TWK TWS TWY TW9 CUHK-AG03 TWC3 CUHK-AG01 TW8 GD69 CUHK-AG02 TW6 TWC2 TWH TW7 TWJ Sino3-11 PUMC02 CUHK-Su10 TW5 LC1 PUMC01 Sino1-11 PUMC03 Urbani HSR1 TW3 TW4 TW1 TW2 LC3 HKU Information about SARS fatality rate and genotypes from all regions are shown in Table 5 . Comparing SARS fatality rates of different regions, it was found that SARS-CoV genomes of genotype T had the highest case fatality rate. Several studies on SARS-CoV genome genotyping and grouping have been published [13] [14] [15] [16] [17] [18] [19] . Some presented the regional and international transmission types according to the epidemic regions and some divided the viruses into early, middle, and late phase groups according to the time of the SARS epidemic in a certain region (Guangdong of China). Based on genome characteristics of 145 SARS-CoVs, a new concept of three groups, two genotypes and eight subtypes was established. The genotype C of group 2 in this work was almost compatible with the regional transmission type and the early, middle phase groups, while genotype T of group 2 was compatible with the international transmission type and the late phase group as well [16, 19] . One of the advantages of the new classification is that it fits global and regional epidemic feature analysis, such as for SARS-CoVs in Taiwan and Singapore. Up to date, all the genomes from clinical samples in the early, middle, and late phase of the SARS epidemic in these regions belonged to genotype T. The concept of dividing all SARS-CoVs into three groups was mainly based on the phylogenetic analysis and the positive Darwinian selection. Because of the limited virus number in group 1 and 3, this grouping has to be further confirmed. The source of SARS-CoVs was traced. It was clear that genotype T was derived from genotype C [14, 18, 19] . The origins of each sub-genotype were also elucidated [22] [23] [24] and are consistent with our results. But the origin of the SARS-CoV remains a question and needs further investigation. Zhao et al. [25] analyzed 16 SARS-CoV genomes and proposed that the closest common ancestor might have existed during the spring of 2002 which is consistent with the date, May 2002, estimated by the spike gene synonymous mutation rate of the three groups in this work. It is very probable that the three groups originated from an unknown common ancestor. The genotype classification of SARS-CoV plays an important role in tracing and controlling SARS prevalence. According to genome sequence variation analysis, genotype shifts of SARS-CoVs in the main epidemic regions during the first epidemic were obvious. The sub-genotypes C1 and C2, that have a closer relationship to the virus of animal origin, first appeared in Guangdong, and then appeared the sub-genotype C3. The dominant viruses in HongKong were of genotype T, but genotype C also appeared there. Similarly, although sub-genotype C4 dominated in Beijing, genotype T also existed there. In Singapore, they had all the genomes of genotypes T, mostly of sub-genotypes T1 and T2. The viruses in Taiwan were all of genotype T, mostly of sub-genotype T4. With more SARS-CoV genome sequences and more complete epidemiological data released, the characteristics of the SARS-CoVÕs variations in the genotypes, as well as the features about the epidemic phases, the epidemic regions, the clinical symptoms, and the virusesÕ adaptation to the human host, will become clearer and clearer. The genotype was apparently correlated with virulence of the SARS-CoV. SARS-CoV GD03T0013 was from the patient with ''mild clinical symptoms'', and its genome sequence was similar to that of the sub-genotype C1. Experiments have also demonstrated that SARS-CoVs of the genotype T had a stronger ability to cause cyto-pathogenic effects (CPE) than those of the genotype C. The CPE of the genotype T remained stable during passages in cell cultures [26] . In the first epidemic, SARS patients in Guangzhou seldom suffered from diarrhea, and the SARS-CoV genomes were mainly of the genotype C [27] . The majority of SARS patients in Hongkong, however, suffered from diarrhea and it turned out that SARS-CoVs of the genotype T dominated [28] . SARS-CoVs of different genotypes might have different virulence. Moreover, the serum from mice immunized with inactivated SARS-CoV BJ01 of genotype C is able to neutralize the invasiveness of SARS-CoV BJ01 itself into Vero cells [29] . Further investigations are required to demonstrate whether this serum could prevent the cells from the invasiveness of the SARS-CoV PUMC01 or others of the genotype T. If, in the future, a SARS-CoV epidemic appears again, after rounds of random mutations, it might cause slight clinical symptoms at the initial stage of the infection. Through positive Darwinian selection, a virus with greater virulence might cause more severe clinical symptoms. A period of neu- Table 4 Polymorphism and diversity analysis of the spike gene among the genotypes and the groups 5327 349 7 42 11 Beijing 2521 193 8 14 5 Guangdong 1512 58 4 26 3 Shanghai 8 2 25 2 1 Zhejiang 4 1 25 0 1 Hubei 7 1 14 0 1 Hong Kong 1755 299 17 3 12 Singapore 238 33 14 0 24 Taiwan 346 37 11 0 22 Canada 251 43 17 0 1 Germany 9 0 0 0 2 Italy 4 0 0 0 2 Russian 1 0 0 0 1 Thailand 9 2 22 0 1 Other 156 11 7 0 0 Total 8096 774 10 45 76 Severe acute respiratory syndrome: global initiatives for disease diagnosis Coronavirus as a possible cause of severe acute respiratory syndrome Identification of a novel coronavirus in patients with severe acute respiratory syndrome A novel coronavirus associated with severe acute respiratory syndrome Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China SARS-beginning to understand a new virus SARS-associated coronavirus transmitted from human to pig The role of evolution in the emergence of infectious diseases Laboratoryacquired severe acute respiratory syndrome Severe Acute Respiratory Syndrome (SARS) in Taiwan, China SARS: one suspected case reported in China Review of probable and laboratory-confirmed SARS cases in southern China Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection Severe acute respiratory syndromeassociated coronavirus genotype and its characterization Molecular phylogeny of coronaviruses including human SARS-CoV Phylogeny of SARS-CoV as inferred from complete genome comparison Coronavirus genomicsequence variations and the epidemiology of the severe acute respiratory syndrome Molecular biological analysis of genotyping and phylogeny of severe acute respiratory syndrome associated coronavirus Molecular evolution of the coronavirus during the course of the SARS epidemic in China Ministry of Public Health. Ministry of public health announces the investigation result of epidemic of SARS happens in Beijing Mosaic evolution of the severe acute respiratory syndrome coronavirus Genomic characterisation of the severe acute respiratory syndrome coronavirus of Amoy Gardens outbreak in Hong Kong Molecular epidemiology of the novel coronavirus that causes severe acute respiratory syndrome Characterization of severe acute respiratory syndrome coronavirus genomes in Taiwan: molecular epidemiology and genome evolution Moderate mutation rate in the SARS coronavirus genome and its implications Isolation, identification and the variance of a coronavirus from an imputting SARS case. Chin Sequence analysis of the complete S gene of SARS-CoV isolated in Guangdong province Enteric involvement of severe acute respiratory syndrome-associated coronavirus infection Inactivated SARS-CoV vaccine prepared from whole virus induces a high level of neutralizing antibodies in BALB/c mice Acknowledgments: We sincerely thank Charles Bernstein PhD, at Research in Medicine, University of Manitoba, Canada, for English assistance. We are grateful for the critical technical assistances supplied by Niu Yu-Xin PhD, at James D. Watson Institute of Genome Sciences, Zhejiang University, China, and Zuyuan Xu PhD, at Research in Medicine, University of Manitoba, Canada.