key: cord-257022-6vw88jib authors: SHANG, Lei; QI, Yan; BAO, Qi-Yu; TIAN, Wei; XU, Jian-Cheng; FENG, Ming-Guang; YANG, Huan-Ming title: Polymorphism of SARS-CoV Genomes date: 2006-04-30 journal: Acta Genetica Sinica DOI: 10.1016/s0379-4172(06)60061-9 sha: doc_id: 257022 cord_uid: 6vw88jib Abstract In this work, severe acute respiratory syndrome associated coronavirus (SARS-CoV) genome BJ202 (AY864806) was completely sequenced. The genome was directly accessed from the stool sample of a patient in Beijing. Comparative genomics methods were used to analyze the sequence variations of 116 SARS-CoV genomes (including BJ202) available in the NCBI Gen-Bank. With the genome sequence of GZ02 as the reference, there were 41 polymorphic sites identified in BJ202 and a total of 278 polymorphic sites present in at least two of the 116 genomes. The distribution of the polymorphic sites was biased over the whole genome. Nearly half of the variations (50.4%, 140/278) clustered in the one third of the whole genome at the 3′ end (19.0 kb-29.7 kb). Regions encoding Orf10–11, Orf3/4, E, M and S protein had the highest mutation rates. A total of 15 PCR products (about 6.0 kb of the genome) including 11 fragments containing 12 known polymorphic sites and 4 fragments without identified polymorphic sites were cloned and sequenced. Results showed that 3 unique polymorphic sites of BJ202 (positions 13 804, 15 031 and 20 792) along with 3 other polymorphic sites (26 428, 26 477 and 27 243) all contained 2 kinds of nucleotides. It is interesting to find that position 18379 which has not been identified to be polymorphic in any of the other 115 published SARS-CoV genomes is actually a polymorphic site. The nucleotide composition of this site is A (8) to G (6). Among 116 SARS-CoV genomes, 18 types of deletions and 2 insertions were identified. Most of them were related to a 300 bp region (27 700–28 000) which encodes parts of the putative ORF9 and ORF10–11. A phylogenetic tree illustrating the divergence of whole BJ202 genome from 115 other completely sequenced SARS-CoVs was also constructed. BJ202 was phylogeneticly closer to BJ01 and LLJ-2004. Severe acute respiratory syndrome (SARS) is a new infectious disease that first emerged in Guangdong province, China, in November, 2002 and then quickly spread worldwide before being successfully controlled in 2003 by classical public health measures"'. It had the characteristic of high mortality and morbidity. Within a short period of six months since its first outbreak, it affected 8 096 people and led to 774 deaths'21. Since the publication of the first complete genomic sequence of SARS-COV'~], 115 SARS-CoV genomic sequences have been completed and hun-dreds of additional partial sequences are available in the NCBI GenBank, all of which provide a strong foundation for a better understanding of the transmission and molecular evolution of SARS-CoV. However, it is still a great challenge to establish the relationship between the observed genomic variations and the biology of SARS-CoV. SARS-Coy an enveloped, positive-stranded RNA virus, was determined to be a new member of the Coronaviriahe family and the supposed transmission was from the wild animal to human being'41. It has the largest genome size among RNA viruses and a broad host The entire 29 700-base genome of SARS-CoV contains 12 putative open reading frames including four major sttuctural proteins: the spike protein (S protein, 1 255 aa) which mediates attachment to cellular receptors and entry by fusion with cell membranes; the small envelope protein (E protein, 76 aa) which acts as a scaffold protein to trigger assembly; the membrane protein (M protein, 221 aa) which is an integral membrane protein involved in budding and interaction with the nucleocapsid and S proteins and the nucleocapsid protein (N protein, 422 aa)[397-10' . The functions of some non-structural proteins, such as polyproteins of the replicase complex encoded by Orfl a and Orfl b, have also been identified. A characteristic of RNA viruses is their genetic instability, which enables the viruses to escape attack by the host immune system and to change their host range and tissue tropism more frequently. On the other hand, over a long epidemic period, the rate of synonymous mutations of the coding sequences of SARS-CoV was constant, while the rate of non-synonymous mutation (amino acid substitution) de~reased'~'~]. Pairwise analysis of the WKs for the genotypes in each epidemic phase showed that the average KdKs for the early phase was significantly higher than that of the middle phase, and even higher than that in the later phase[42 ' I. In this work, the genome of one SARS-CoV isolated directly from the stool sample of a SARS patient was completely sequenced. Comparative genomics analysis was performed to reveal the biological characteristic of the SARS-CoV genome. A male patient was hospitalized in Beijing Youan Hospital on April 29, 2003, a week after the onset of the respiratory disease. Based on the clinical characteristics that satisfied the WHO case definitions (http://www. who.int/csr/smfcasedef~tioden/), he was diagnosed as a SARS patient. One stool sample for direct SARS-CoV genome sequencing was collected on May 23,2003 (31 d after onset of the disease), from which SARS-CoV BJ202 genome sequence was completed. About 0.2 gram of stool sample was taken into a 1.5 mL tube, and 1 mL of 0.8% NaCl was added. The tube was vortexed to suspend the stool sample thoroughly and then was centrifuged for 3 min at 5 000 r/min. About 140 pL of the supernatant was collected to extract viral RNA with an RNA extraction kit (Qiagen). The RNA was dissolved in 100 pL diethyl pyrocarbonate (DEPC)-treated water containing 1 U DNase I (Promega). cDNA was synthesized by reverse transcription from 10 pL RNA at 45°C PCR products containing polymorphic sites were ligated into the T-vector (Promega). About 20 recombinant clones from each ligation were isolated and sequenced. Only the high quality nucleotides (Phred/Phrap/Consed, >Q40) at the polymorphic sites were calculated. The PCR products were used for dmct sequencing analysis on ABI 377 sequencers (Applied Biosystems) and MegaBACE 1 OOO (Amersham). We used Phred/Phrap/ Consed package version 13.0 for processing all of the raw sequence data. Base calling was performed by Phred (http://www.phrap.org). Contaminations fmm human and other resources were removed by CrossMatch and the complete sequence was assembled using Phrap (http://www. phrap.org). The gaps, as well as the regions with low quality data, identified after the preliminary assembly, were filled in or refined by re-sequencing the PCR products. comparative analysis were retrieved from NCBI (Table 1). All the open reading frames were identified with O W Finder (http://www.ncbi.nlm.nih.gov/gOrf/ Gorf.html). Comparative analysis was performed us-1.5 Annotation and comparative genome analysis 115 complete SARS-CoV genome sequences for ing BLAST against the nr (non-redundant) database The complete genome sequence of BJ202 was 29 751 bp in length. Compared with the genome sequence of GZO2, it had a deletion of 29 bp corresponding to the region of 27 884-27 912 bp (Orf10-11) in the GZO2 genome. Table 2 ). It is interesting to find a unique polymorphic site 15 031 in BJ202 in a region with the lowest mutation frequency (Fig.l) . Except for the polyA region of the 3' end, no other deletions or insertions were identified in the BJ202 genome. Comparative analysis between BJ202 and GZ02 We aligned 116 complete genome sequences of SARS-CoV (including BJ202 ) to analyze their single nucleotide polymorphism (SNPs). There were Orf 13 p f l 0 -1 1 Orf3/4, E protein and M protein had the second highest mutation frequency in that about 5.4% (1.6/29.7 kb) of the sequence contained 12.2% (34/278) of the polymorphic sites. The average mutation rate in this region was up to 21.3 per kb (34/1.6 kb), more than 2 times that of the whole genome. The 21.9-23.9 kb region, which falls into OrfS, had the third highest mutation frequency, in which 39 polymorphic sites were found in the nearly 2 kb stretch of genomic sequence (1 9.5 SNPs in 1 kb). The other two thirds of the SARS-CoV genomic sequence had a lower density of polymorphic sites except for the 6.9-9.9 kb region within Orfla, which contained 32 polymorphic sites in the 3 kb sequence (or 10.6 polymorphic sites per kb). The region of 14.4-17.3 kb had the lowest mutation frequency of the whole genome. Only 6 polymorphic sites were scattered over nearly 3 kb of this part of the genome (Fig. 1 ). We cloned 15 PCR products (about 6.0 kb of the genome) including 11 hgments containing 12 known polymorphic sites and 4 fragments without identified polymorphic sites. Results showed that all 3 unique polymorphc sites of BJ202 (positions 13 804, 15 03 l and 20 792) contained 2 kinds of nucleotides. Polymorphic sites 26 428, 26 477 and 27 243 also contained mixed nucleotides. It is interesting to find that position 18379 which has not been identified to be polymorphic in any of the other 115 published SARS-CoV genomes is a polymorphic site. The nucleotide composition of this site is A (8) to G (6) ( Table 3) . Among 116 SARS-CoV genomes, 18 types of deletions and 2 insertions were identified. Most of them were related to a 300 bp region (27 7W28 OOO bp of the genome) which encoded part of the putative ORF9 and ORF10-11. Eighty-six genomes had the 29 bp deletion. were fiee of any deletion. They mainly constituted genomes of SARS-CoV isolated fiom early phase clinical samples or f?om animals (Table 4 ) ' Number of clones with the same nucleotide as GZ02 to the number of clones with the polymorphic nucleotide; Unique polymorphic sites in BJ202 genome; Not identified as polymorphic in any of the other 1 15 SARS-CoV genomes, but polymorphic in the genome of BJ202. The nucleotide ratio of A to G is 8 to 6. GZ-C genome with 3 deletions: A phylogenetic tree based on the divergence of whole genome from BJ202 and the other 115 completed SARS-CoVs places BJ202 closest to BJOl (Fig.2) . To date, complete genomic sequences of 115 SARS-CoVs are available in NCBI GenBank, which provides a foundation for a better understanding of the polymorphism and molecular evolution of SARS-CoV. As a member of RNA viruses, the genome of SARS-CoV has a higher mutation rate than DNA viruses"". Using GZ02 genome as the reference, nearly Forty-one polymorphic sites have been identified in BJ202 genome. Similar to the polymorphic sites identified in other 11 5 SARS-CoV genomes, they are not evenly distributed over the whole genome. Four regions (8.5-10.1, 19.8-21.0, 22.1-22.6 and 25.7-26.6 kb) which cover only 14.1 % (4.2/29.7 kb) of the whole genome take up 56.1% of all the polymorphic sites (23/41; 9, 4, 5 and 5 polymorphic sites in each respective region). These regions encode parts of Orfla In contrast, other regions of the SARS-CoV genome are highly conserved 'I2'. The 14.4-17.3 kb re-gion has the lowest mutation frequency. Only 6 polymorphic sites scatter over this part of the genome which is nearly 3 kb long. This region, within Orflb, encodes part of the RdRp (NSP9, 13379-16147) and HEL (NSP10, 16148-17950) . NSP9 is a non-structural protein and has RNA-dependent RNA polymerase activity 'I3'. To verify the polymorphic sites in BJ202, we sequenced some cloned PCR products. It is interesting to find that 3 unique polymorphic sites (positions 13 804, 15 031 and 20 792) in the BJ202 genome were all composed of 2 different nucleotides. Positions 26428, 26477 and 27243, which were not unique polymorphic sites to BJ202 genome, were also composed of mixed nucleotides. Position 18379 was actually polymorphic, although it had not been discovered as such in any of the other SARS-CoV genomes. Among 14 clones sequenced, 8 were nucleotide A and 6 were nucleotide G at this position. The results verifies the polymorphic nature of the SARS-CoV genome as previously rep~rted"~'. It also warns us that when we directly sequence the PCR products, we may fail to identify many polymorphic sites. Mapping the deletions and insertions in SARS-CoV genome has been used to analyze genotype groups and the molecular evolution of SARS-CoV 'I". To date, there have been about 18 kinds of deletions and 2 insertions ranging from 2 to 578 bp in length identified in the 116 SARS-CoV genomes including BJ202. The deletions are located in Orf9, Orfla, Orfl, OrfN and mostly in the region encoding Orf 10-11. As discussed above, the region encoding Orf10-11 also has the highest frequency of sequence variations. Of the two insertions, one (10 bp) is in Orflb of GD69 genome and the other (6 bp) in Orf9 of the LLJ2004 genome. Only early phase clinical SARS-CoVs (such as GDOl and GZ02 etc.) and animal origin SARS-CoVs (SZ3 and SZ16 etc.) are free of deletion ' ' ' I. It might mean that OrflO-11 is related to host range or tissue tropism of SARS-CoV against animals and is not a contributing factor in the infection against human beings. Most SARS-CoV genomic sequences in the NCBI GenBank were from viral isolates of Vero E6 cell -ZMY-1 The tree was constructed using the nucleotide number of differences. Bootstrap=l 000. culture '3' 7' 16'. Only a few of them were directly from clinical samples [5' 17] . Although it was reported that the in vitro mutation rate of the SARS-CoV in Vero cell passage was negligible['81, there might be difference between the genomic sequences obtained directly from clinical samples and from isolates of the cell culture. As we know from this work, when S ARS-CoVs of different genotypes were mixed in the same sample, only the dominant genotypes would be identified by the strategy of sequencing the RT-PCR products. Moreover, when the clinical sample is subjected to cell culture, viruses with different genotypes duplicate at different rates. Those with dominant genotypes in the sample might not be the prominently duplicated ones and would be diluted by their counterparts during the cell culture. Consequently, it is difficult to effectively identify the polymorphic sites induced in vivo by sequencing the RT-PCR products of the cell culture isolates. As a member of the RNA viruses, SARS-CoV has a higher mutation rate than DNA viruses. However, most genomic sequences of SARS-CoV available in the NCBI CenBank were derived from viral isolates of cell cultures, which might cause some problems when we analyze the sequence variations of SARS-CoV, because sequence from the cell culture isolates might be different from the direct clinical sample. In this work, we completed the sequencing of SARS-CoV genome directly from the stool sample and analyzed the polymorphism of the SARS-CoV genome. All these would provide valuable experimental data for further research in the mechanism of SARS-CoV mutation. World Health Organization. Severe acute respiratory syndrome (SARS) Severe acute respiratory syndrome: global initiatives for disease diagnosis Isolation and characterization of viruses related to the SARS coronavirus from animals in Southern China Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China Characterization of a novel coronavirus associated with severe acute respiratory syndrome A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJOI) Characterization of severe acute respiratory syndrome coronavirus genomes in Taiwan: Molecular epidemiology and genome evolution Characterization of severe acute respiratory syndrome-associated coronavirus (SARS-CoV) spike glycoprotein-mediated viral entry Initial analysis of complete genome sequences of SARS coronavirus Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection The R protein of SARS-CoV analyses of structure and function based on four complete genome sequences of isolates BJO1-BJ04 Molecular model of SARS coronavirus polymerase: implications for biochemical functions and drug design Genetic variation analysis of SARS coronavirus Bioinformatics analysis of SARS coronavirus genome polymorphism Coronavirus genomic-sequence variations and the epidemiology of the severea cute respiratory syndrome Genomic characterization of the severe acute respiratory syndrome coronavirus of Amoy Gardens outbreak in Hong Kong Mutational dynamics of the SARS coronavirus in cell culture and human populations isolated in 2003 We are indebted to collaborators and clinicians from Hangzhou Genomics Institute, Beijing Genomics Institute and Beijing Youan Hospital.