key: cord-1050314-f7b39xj2 authors: Tang, Xiaolu; Ying, Ruochen; Yao, Xinmin; Li, Guanghao; Wu, Changcheng; Tang, Yiyuli; Li, Zhida; Kuang, Bishan; Wu, Feng; Chi, Changsheng; Du, Xiaoman; Qin, Yi; Gao, Shenghan; Hu, Songnian; Ma, Juncai; Liu, Tiangang; Pang, Xinghuo; Wang, Jianwei; Zhao, Guoping; Tan, Wenjie; Zhang, Yaping; Lu, Xuemei; Lu, Jian title: Evolutionary analysis and lineage designation of SARS-CoV-2 genomes date: 2021-02-06 journal: Sci Bull (Beijing) DOI: 10.1016/j.scib.2021.02.012 sha: f71b431cbfbe27dfbfc81229f945fd6804844c03 doc_id: 1050314 cord_uid: f7b39xj2 The pandemic due to the SARS-CoV-2 virus, the etiological agent of Coronavirus Disease 2019 (COVID-19), has caused immense global disruption. With the rapid accumulation of SARS-CoV-2 genome sequences, however, thousands of genomic variants of SARS-CoV-2 are now publicly available. To improve the tracing of the viral genomes’ evolution during the development of the pandemic, we analyzed single nucleotide variants (SNVs) in 121,618 high-quality SARS-CoV-2 genomes. We divided these viral genomes into two major lineages (L and S) based on variants at sites 8782 and 28144, and further divided the L lineage into two major sublineages (L1 and L2) using SNVs at sites 3037, 14408, and 23403. Subsequently, we categorized them into 130 sublineages (37 in S, 35 in L1, and 58 in L2) based on marker SNVs at 201 additional genomic sites. This lineage/sublineage designation system has a hierarchical structure and reflects the relatedness among the subclades of the major lineages. We also provide a companion website ( www.covid19evolution.net ) that allows users to visualize sublineage information and upload their own SARS-CoV-2 genomes for sublineage classification. Finally, we discussed the possible roles of compensatory mutations and natural selection during SARS-CoV-2’s evolution. These efforts will improve our understanding of the temporal and spatial dynamics of SARS-CoV-2’s genome evolution. As expected, the phylogenetic tree showed a clear delineation between the L and S 246 lineages (Fig. 1) . Consistent with previous observations [12] [13] [14] [15] [16] 19 ], S was more closely 247 related to RaTG13 and GD Pangolin-CoV than L. The L lineage could be further divided into 248 two major sublineages (L1 and L2) using three tightly-linked SNVs (C3037U, C14408U, and 249 A23403G). Specifically, out of the 9,698 L-lineage genomes in Fig. 1 , 524 (5.4%) belonged 8 250 to the L1 sublineage (C3037, C14408, and A23403), and 9127 (94.1%) belonged to the L2 251 sublineage (U3037, U14408, and G23403). The remaining 47 (0.5%) could be assigned to 252 neither the L1 nor L2 sublineage. 261 In principle, recombination is invoked to explain the four haplotypes only when the 262 recombination rate is substantially higher than the mutation rate, which might be violated for 263 the SARS-CoV-2 viruses. The extent of non-random association of two variants in a given (Table 1) . Likewise, the distinction of L1 and L2 in the phylogenetic tree ( Fig. 1) 272 is also congruent with the strong LD among sites 3037, 14408, and 23403 (Table 1) . These 273 observations inspired us to identify the SNV pairs that were in significant LD systematically. 274 Specifically, we required a significant LD pair to meet three criteria: 1) ≥ 0.9, 2) LOD ≥ 2 275 150 (equivalent to P ≤ 10 -150 ), and 3) the minor allele frequencies of both sites were no less 276 than 0.5% in at least one major clade (S, L1, or L2 290 see Table 2 for some examples; see Table S3 Table S5 online for details). 321 sublineages in S (based on 58 sites) (Fig. S6 online) . The nomenclature of a sublineage was in 322 the format of Lx or Sx, where x was an integer starting from 1. A sublineage could be divided 323 into 2nd-tier subclades, each of which ended with a lower-case letter (e.g., L2b), which could 324 be further divided into 3rd-tier haplotypes that ended with an integer (e.g., L2b5). 325 Occasionally, a 3rd-tier sublineage was divided into 4th-tier subclades (ending with a 326 lower-case letter, e.g., L2b5d), 5th-tier haplotypes (ending with an integer, e.g., L2b5d2), and 327 even 6th-tier sublineages (ending with a lower-case letter, e.g., L2b5d2b). Thus, our 328 nomenclature system, which was based on nested or high-frequency marker SNVs, was 329 hierarchical. The S lineage was divided into ten sublineages that were termed S1-S10. Further 359 out of the total analyzed genomes) were labeled with * (asterisk) symbols to represent 360 uncertain belongingness of a subclade in a given clade. For instance, S* belonged to the S 361 lineage but did not fall into S1-S10; S1b* belonged to S1b but did not fall into any subclades 362 in S1b (i.e., S1b1, S1b2, or S1b3). Generally, the * strains were few in number in the viral 363 population, presumably due to mutations at the marker SNV sites, or due to sequencing errors. 364 It is also plausible that some * strains may represent the transitional stages between two 365 sublineages during viral evolution, but they were under-represented in the GISAID dataset The haplotype network analysis is powerful for tracing viral genealogies when both the 384 ancestral and descendent samples are analyzed [23, 24] . To trace the evolutionary trends of 385 the SARS-CoV-2 genomes, we reconstructed the haplotype networks of the sublineages using 386 all 206 marker SNV sites (Fig. 4) 389 As expected, the network analysis showed a distinct separation between the L and S lineages, 390 as well as the delineation between L1 and L2 sublineages. Within S, the sublineages S3-S10 391 were likely derived from S2 (Fig. 4) . Within L, the separation between L1 and L2 lineages 392 was also clearly shown in the network analysis results, with L1 and L2 designated as the 393 ancestral and derived forms, respectively. L1a and L2a were inferred to be the ancestral forms 394 in L1 and L2, respectively. The haplotype network analysis, therefore, provided important 395 insights into the genealogies of the SARS-CoV-2 genomes. 545 Within L2, the subclade L2d6 was characterized by the C28854U variant. However, the 546 pre-requisite was that only strains that carry two groups of linked variants (U3037, U14408, 547 and G23403 that defined L2, and A28881, A28882, and C28883 that further defined L2d) 548 simultaneously would be further examined for whether they carry the C28854U variant for 549 L2d6 designation. Although recurrent variants at site 28854 might be common in lineages 550 other than L2d6, they would have a very limited effect on the designation of L2d6. Therefore, 551 our lineage nomenclature system is hierarchical and robust to individual recurrent variants. The sublineages exhibited substantial differences spatially and temporally. Our analysis 554 showed that adaptive evolution is likely to drive certain sublineages, such as L2, to increase 061 non-redundant genomes for 142 tree reconstruction. The genome sequence of bat coronavirus RaTG13 (GenBank accession 143 number: MN996532), and GD Pangolin-CoV (the SARS-CoV-2-related viruses in Malayan 144 pangolin samples obtained by anti 145 merged from GISAID: EPI_ISL_410544 and Genome Warehouse: GWHABKW00000000 as 146 previously described [12]) was sequentially added as outgroups by MAFFT GTR+G -B 1000) [33] was used to construct the maximum likelihood 148 phylogenetic tree. Interactive Tree Of Life (iTOL) The L lineage was further divided into L1 and L2 major 151 sublineages by three tightly linked genomic variants (C3037U, C14408U, and A23403G) Occasionally, an 156 LD pair in a certain sublineage was not detected in the global population analysis because 157 the variants had very low frequencies and were neglected by Haploview, or the LD detection 158 was interfered by recurrent mutations. Therefore, to recover the LD pairs significantly linked 159 in a sublineage but failed to be detected in the global analysis, besides the global viral 160 population, we also analyzed the LD patterns between SNVs in the S, L1, and L2 clade, 161 respectively We first inferred the ancestral states of the 206 marker SNV sites in SARS-CoV-2 using SARS-CoV-2 and coronaviruses in bats and pangolins, which was recently evaluated with 170 molecular evolution simulations [37]. In addition, the 44-way whole-genome sequence 171 alignments of SARS-CoV-2 and bat coronaviruses in UCSC Occasionally, the ancestral state of an SNV site could not be unambiguously inferred based 174 on the nucleotides in the outgroups Overall, among the 206 maker SNV sites the nucleotides of the reference genome 179 (NC_045512) were inferred to be ancestral at the other 204 sites (see Fig. S1 online for 180 details). Of note Although the U29095 variant was observed in the orthologous sites of many SARS-CoV-2 182 related coronaviruses Hence, we inferred the reference allele C29095 to be the ancestral one, and recurrent For each of the 130 sublineages, the major haplotype sequence was inferred for the 206 The nucleotides in the 206 orthologous sites of RaTG13 were used to root 189 the haplotype network. DnaSP v6.12.03 [38] was used to generate the haplotype data format, 190 and PopART v1.7 [39] was used to draw haplotype networks. The haplotype network was 191 inferred with the TCS Networks We extracted the detailed information of the high-quality SARS-CoV-2 genomes (the We downloaded 217,305 SARS-CoV-2 genomes from the GISAID database An intense sampling of viruses in a specific location or during a short period of time 210 would cause an excess of highly similar viral genomes in the GISAID database, and 211 potentially leads to biased estimations of the global frequencies of variants Among the 121,618 genomes, we identified 29,091 SNVs at 20,487 genomic sites after 217 trimming the 5' (1-220) and 3 ± 5 (mean ± standard deviation; ranging from 0 to 198). Of these identified SNV sites 5%) were multi-allelic. The majority of these SNVs had 221 very low minor allele frequencies (MAF), including 9839 singletons (33.8%) Previous studies used bat coronavirus RaTG13 or pangolin coronavirus to root At that time, the accuracy of such ancestral inferences 231 remained uncertain Recently, molecular evolution simulations have demonstrated 232 that using these animal coronaviruses as outgroups can yield an accuracy of > 95.98% for 233 inferring the ancestral state for a variant of SARS-CoV-2 Since phylogeny reconstruction with 236 all the genomes is computationally challenging, we clustered the 121,618 high-quality 237 genomes with a sequence identity cutoff of 99.9%. This yielded 10,061 non-redundant 238 genomes that were used for tree reconstruction 0%) out of the 121,618 non-redundant 399 genomes had detailed date information for virus isolation. When the lineages were traced over 400 time, we found that the L lineage kept increasing as the pandemic progressed (Fig. 5a) The 406 frequency of L2c (characterized with U15324) became higher from November 2020, although 407 the overall frequency of L2c was still relatively low (Fig. 5a, see Fig The lineages were strongly biased in spatial distributions due to high rates of strain 411 isolation and sequencing in some locations as compared to others and their frequencies alternated (Fig. 5e). In South America and Africa, where the 424 numbers of genomes in GISAID were relatively small (n = 1307 and 2099, respectively), L2d 425 seemed to be the dominant sublineage (Fig. 5f, g) we presented the detailed numbers of genomes in each 429 lineage/sublineage at both global and at continental levels. A more detailed distribution of the 430 viral lineages and sublineages is shown on our user-friendly website The frequency of a viral variant can fluctuate temporally or spatially due to sampling 434 bias (i.e., the founder effect Nevertheless, the frequencies of some viral variants 435 may have changed due to the transmission or pathogenicity of the virus. For example, the 436 A82V amino acid change in the glycoprotein of the Ebola virus spread rapidly during the 437 2013-2016 Ebola outbreak, and the frequency of this variant eventually reached over 90% 438 among all sequenced Ebola genomes Several recent studies 440 have shown that some variants of SARS-CoV-2 might be associated with viral 441 transmissibility This is consistent with a recent 446 study that reported that the G614 variant was driven by adaptive evolution [60]. The pattern 447 has been observed in multiple regions (Figs. 5 and S11a online), indicating that adaptive 448 evolution might be an important force driving the prevalence of the L2 sublineage Ab, aB, and 459 ab; A and B are ancestral, and a and b are derived at the two sites, as described above), one to neutral expectation at the global scale (Fig. 2a). One hypothesis to account for 463 Specifically, under the multiple independent mutations model (Fig. S2b online), both A→a 465 and B→b mutations are deleterious, and hence both Ab and aB have reduced fitness than AB A→a in the Ab molecule, or B→b 467 in the aB molecule) produces the ab haplotype with normal or even higher fitness. Similarly, 468 epistasis can also cause ab to have a similar or even higher fitness than both Ab and aB in 14 469 other scenarios such as recombination (Fig. S2a online) or recurrent mutations For instance, for 474 the 3037/14408/23403 linkage group which defined L1 and L2, 7720 genomes carried the 475 ancestral allele (C3037, C14408, and A23403), 108,833 genomes carried the triple-mutant 476 allele (U3037, U14408, and G23403), while only 22~141 genomes carried the possible 477 transitional haplotypes (Fig. 6a). In other words, the A23403G mutation 985 482 genomes carried the triple-mutant allele (A28881, A28882, and C28883), while only 1~107 483 genomes carried the possible transitional haplotypes (Fig. 6b) More than three decades ago, Motoo Kimura proposed that compensatory neutral 487 mutants (i.e., two mutations that are deleterious individually but jointly restore normal fitness) patterns. Moreover, simultaneous mutations followed by reverse mutations S2e online) might explain the non-random associations between the variants. For instance, the 494 A28881/A28882/C28883 variants likely resulted from one replication event and were then 495 maintained by natural selection during evolution. Deciphering the effects of individual and 2 genome sequences, there are 501 thousands of genetic variants of SARS-CoV-2 available for analysis. In this study, we 502 designated SARS-CoV-2 lineages with 206 marker SNV sites, the majority of which were in 503 strong LD. Our nomenclature system of lineage and sublineage designation has a hierarchical 504 structure and is reflective of the relative relatedness among the subclades of the major clades Phylogenetic inferences are usually made under the assumption of hierarchical 509 bifurcating trees (i.e., one lineage splits into two descendant lineages). However, the 510 evolution of viruses often violates the bifurcating assumption and evolves in the form of 511 multifurcation, especially in the existence of the super-spreaders For instance, although the phylogenetic analysis revealed the clear delineation 515 between L and S lineages and the distinction between L1 and L2 clades (Fig. 1), we obtained 516 very complicated results when we analyzed the phylogenetic relationships of the viruses 517 within each of the three major clades (S, L1, or L2). As shown in Fig. S13a (online), on the 518 phylogenetic tree of the S genomes On the other hand, the 521 network analysis revealed that within S1, other subclades radiated from S1b2a; and that S3-522 S10 sublineages radiated from S2 (Fig. S13b online). Similarly, the L1a genomes, which were 523 inferred to be the ancestral form within the L1 clade, were scattered on the phylogenetic tree 524 of the L1 genomes (Fig. S13c online; see Fig. S13d online for the haplotype network of L1 525 clade); and the L2a genomes clade, were also scattered on the phylogenetic tree of the L2 genomes (Fig. S12e online; see A possible explanation to reconcile 528 these discrepant results is that during the continuing evolution of the viral genomes, the 529 SARS-CoV-2 viruses experienced multifurcating forms of evolution in each major clade Since most of the maker SNVs used to designate the sublineages were in strong LD Recurrent variants (homoplasies) are common in SARS-CoV-2 strains, although most of 538 such variants tend to have very low frequencies (usually < 1%) in SARS-CoV-2 populations 539 [11, 63]. In this study, several marker SNVs Yaping Zhang, and Wenjie Tan conceived the presented idea and 596 supervised the project with input from Xinghuo Pang Jian Lu analyzed the data and interpreted the results. Shenghan Gao, Songnian Hu, Juncai Ma, 599 and Tiangang Liu contributed to data interpretation Xiaoman Du, and Yi Qin developed the 601 accompanying website with the suprevison of A new coronavirus associated with human respiratory disease 606 in China A novel coronavirus from patients with pneumonia in 608 China Identification of a novel coronavirus causing severe 610 pneumonia in human: a descriptive study A pneumonia outbreak associated with a new 612 coronavirus of probable bat origin Identifying SARS-CoV-2 related coronaviruses in 614 Malayan pangolins Viral metagenomics revealed Sendai virus and coronavirus 616 infection of Malayan pangolins (Manis javanica) Are pangolins the intermediate host of the 2019 novel 618 coronavirus (SARS-CoV-2)? Isolation of SARS-CoV-2-related coronavirus from Malayan 620 pangolins disease and diplomacy: GISAID's innovative 622 contribution to global health Global initiative on sharing all influenza data -from vision 624 to reality Emergence of genomic diversity and recurrent 626 mutations in SARS-CoV-2 On the origin and continuing evolution of SARS-CoV-2 Genomic variations of SARS-CoV-2 suggest multiple 630 outbreak sources of transmission Decoding the evolution and transmissions of the novel 632 pneumonia coronavirus (SARS-CoV-2 / HCoV-19) using whole genomic data Phylogenetic analyses of the severe acute respiratory 635 syndrome coronavirus 2 reflected the several routes of introduction to Taiwan, the United 636 States Phylogenetic network analysis of SARS-CoV-2 638 genomes Mutations, recombination and insertion in the evolution of 640 2019-nCoV Exploring the coronavirus pandemic 642 with the WashU Virus Genome Browser Viral and host factors related to the clinical outcome of 644 COVID-19 The impact of mutations in SARS-CoV-2 spike on viral infectivity 646 and antigenicity A dynamic nomenclature proposal for 648 SARS-CoV-2 lineages to assist genomic epidemiology Nextstrain: real-time tracking of pathogen evolution Reconstructing disease outbreaks from genetic data: 652 a graph approach Analysis of haplotype networks: The randomized minimum spanning tree 654 method Phylogenetic analysis of SARS-CoV-2 data is difficult A nomenclature system for the tree of human 658 Y-chromosomal binary haplogroups Punctuated bursts in human male demography 660 inferred from 1,244 worldwide Y-chromosome sequences Updated comprehensive phylogenetic tree of global human 662 mitochondrial DNA variation MAFFT multiple sequence alignment software version 7: 664 improvements in performance and usability SNP-sites: rapid efficient extraction of SNPs from 666 multi-FASTA alignments The Sequence Alignment/Map format and 668 SAMtools Search and clustering orders of magnitude faster than BLAST IQ-TREE 2: New models and efficient 672 methods for phylogenetic inference in the genomic era Haploview: analysis and visualization of LD and 674 haplotype maps The variant call format and VCFtools PLINK: a tool set for whole-genome association 678 and population-based linkage analyses The use of SARS-CoV-2-related coronaviruses from bats and 680 pangolins to polarize mutations in SARS-Cov-2 DNA sequence 682 polymorphism analysis of large data sets POPART: full-feature software for haplotype network construction TCS: a computer program to estimate gene 686 genealogies Median-joining networks for inferring intraspecific 688 phylogenies Sampling bias and incorrect rooting make 690 phylogenetic network tracing of SARS-CoV-2 infections unreliable Median-joining network analysis 693 of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary No evidence for distinct types in the evolution of 696 SARS-CoV-2 Explaining phylogenetic network analysis of SARS-CoV-2 genomes The causes and consequences of HIV evolution On the founder effect in COVID-19 outbreaks: how many 703 infected travelers may have started them all? Did a single amino acid change make Ebola virus more virulent? 705 Ebola virus glycoprotein with increased 707 infectivity dominated the 2013-2016 epidemic Human adaptation of Ebola virus 709 during the West African outbreak Tracking changes in SARS-CoV-2 Spike: 711 evidence that D614G increases infectivity of the COVID-19 virus Patient-derived SARS-CoV-2 mutations impact viral 714 replication dynamics and infectivity in vitro and with clinical implications in vivo The Spike D614G mutation increases 717 SARS-CoV-2 infection of multiple human cell types SARS-CoV-2 spike-protein D614G mutation 719 increases virion spike density and infectivity Naturally mutated spike proteins of SARS-CoV-2 721 variants show differential levels of cell entry Structural and functional analysis of the 723 D614G SARS-CoV-2 spike protein variant SARS-CoV-2 viral spike G614 mutation exhibits higher 725 case fatality rate Could the D614G substitution in the 727 SARS-CoV-2 spike (S) protein be associated with higher COVID-19 mortality? SARS-CoV-2 genomic variations 730 associated with mortality rate of COVID-19 Potentially adaptive SARS-CoV-2 mutations 732 discovered with novel spatiotemporal and explainable AI models The role of compensatory neutral mutations in molecular evolution Mapping genome variation of 736 SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders Evidence for strong mutation bias towards, 739 and selection against, U content in SARS-CoV-2: implications for vaccine design World Health Organization. SARS-CoV-2 Variant -United Kingdom of Great Britain and 742 Northern Ireland Investigation of novel SARS-CoV-2 variant: Variant of Concern Preliminary genomic characterisation of an 749 emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations SARS-CoV-2 Variant Under 754 Investigation 202012/01 has more than twofold replicative advantage Estimated transmissibility and severity of novel 757 SARS-CoV-2 Variant of Concern 202012/01 in England She received a bachelor's degree from Northwest A&F University in 2017. Her research interest includes the evolution of viral genomes and the translational regulation of eukaryotes She received a bachelor's degree from Zhejiang University in 2019. Her research interest includes small RNA-mediated gene regulation and the evolution of viral genomes He received his Ph.D. degree in Ecology and Evolution from the University of Chicago in 2008. His research interests include the mechanisms and evolutionary principles of post-transcriptional gene expression regulation, the genetic basis of adaption Pairwise LD analysis for the marker SNVs at sites 8782/28144 (S/L delineation) and sites 765 3037/14408/23403 (L1/L2 delineation) LD was analyzed for the four pairs of sites in three datasets: 1) 10,061 genomes used for the 767 construction of phylogenetic tree; 2) 121,618 genomes obtained after redundancy filtering and 768 quality control; 3) 202,679 genomes obtained after initial quality control. The LOD value was For the pairs other than 8782/28144, the 775 sizes of the haplotypes (the numbers of genomes) in a major clade were also given in the 776 parenthesis. The inferred ancestral nucleotides are in black, and the derived variants are The phylogenetic tree of 10,061 SARS-CoV-2 genomes. The phylogenetic tree was 779 rooted with the bat coronavirus RaTG13 and GD Pangolin-CoV (the SARS-CoV-2-related 780 viruses in Malayan pangolin samples obtained by anti-smuggling operations by the Note that S was clearly delineated from L, and L further 782 separated into L1 and L2 major sublineages The genomes that could not be assigned to S or L are in purple, and the 784 L-lineage genomes that could not be assigned to L1 or L2 are in yellow The normalized frequencies of the four 788 haplotypes (namely AB, Ab, aB, and ab; A and B are the ancestral alleles, and a and b are the 789 derived alleles) for the 202 significant LD pair. Each dot means the frequency of a certain 790 haplotype for a pair, and the four haplotypes for an LD pair were connected with lines In (a-d), a colored triangle represented a 795 subclade lineage, and the width of the triangle was in scale to the number of the genomes in a 796 clade. For a sublineage, the number of genomes, as well as its percentage in the major clades 797 ((a) for all the genomes; (b-d) for S, L1, and L2, respectively), were given in parentheses. All 798 the SNVs were in coding regions, and the derived alleles (nonsyn, red; syn, blue) labeled in 799 each branch were shared by all the descendant subclades. Except sites 8782 and 28144, the 800 nucleotides in the reference genome at all the other 204 marker SNV sites were inferred to be 801 the ancestral states. All the variants were given in the ancestral/position/derived format. The 802 detailed information for the SNVs that specifically define each lineage or sublineage was 803 given in Table S6 (online) The 206 marker SNVs were considered 808 in the haplotype network analysis, and the major haplotype of each sublineage was used as 809 the representative of that sublineage. The size of each sublineage was scaled to the number of 810 genomes in that sublineage. The number of variants (out of 206 sites) between two 811 neighboring sublineages is labeled orthologous sites were considered in the haplotype network analysis; 2) the haplotype 814 network reflected the relative relatedness between the haplogroups but did not necessarily 815 mean one haplotype directly evolved from the neighboring ancestral haplotype because some 816 of the intermediate genomes might be missing in the genomes so far sequenced; and 3) an 817 edge linking RaTG13 and the S7 node (distinct from S2 by the U29095 variant) was manually 818 removed in the haplotype network because it was likely caused by a recurrent mutation on 819 site 29095 in S7 Temporal and spatial distributions of the sublineages in the whole world (a) and 822 individual continents (b-g) based on 119,168 genomes that had detailed date information. The 823 number of genomes was summarized at a two-week interval, and the frequency of each 824 sublineage In each linkage group, the inferred ancestral nucleotides are in black, and the 829 derived variants are in red. The haplotypes that carry the reference alleles are presented in a 830 grey background. The numbers of the haplotypes and the possible evolutionary paths from the 831 ancestral to derived via the transitional haplotypes in a clade are given. 832 新型冠状病毒基因组的演化分析及谱系划分 833 834 唐小鹿 a,1 , 应若晨 a,1 , 姚欣敏 a,1 , 李广浩 b,c , 吴长城 a , 汤易雨立 b , 李志达 d , 邝碧姗 d , 835 伍锋 d