key: cord-0810050-bcxbtnpt authors: Bai, Yunmeng; Jiang, Dawei; Lon, Jerome R.; Chen, Xiaoshi; Hu, Meiling; Lin, Shudai; Chen, Zixi; Wang, Xiaoning; Meng, Yuhuan; Du, Hongli title: Comprehensive evolution and molecular characteristics of a large number of SARS-CoV-2 genomes reveal its epidemic trends date: 2020-08-28 journal: Int J Infect Dis DOI: 10.1016/j.ijid.2020.08.066 sha: cab19beee90255776e8021e6b95b72fef10de267 doc_id: 810050 cord_uid: bcxbtnpt Objectives To further reveal the phylogenetic evolution and molecular characteristics of the whole genome of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) based on a large number of genomes and provide a basis for the prevention and treatment of SARS-CoV-2. Methods Various evolution analysis methods were employed. Results The estimated ratio of the rates of non-synonymous to synonymous changes (Ka/Ks) of SARS-CoV-2 was 1.008 or 1.094 based on 622 or 3624 SARS-CoV-2 genomes and 9 key specific sites of highly linkage and four major haplotypes H1, H2, H3 and H4 were found. The results of Ka/Ks, detected population size and development trends of each major haplotype showed H3 and H4 subgroups were going through a purify evolution and almost disappeared after detection, indicating they might have existed for a long time. H1 and H2 subgroups were going through a near neutral or neutral evolution and globally increased with time, and the frequency of H1 was generally high in Europe and correlated to death rate (r>0.37), suggesting these two haplotypes might relate to infectivity or pathogenicity of SARS-CoV-2. Conclusions Several key specific sites and haplotypes related to infectivity or pathogenicity of SARS-CoV-2 as well as the possible earlier origin time and place of SARS-CoV-2 were indicated based on evolution and epidemiology of 16373 SARS-CoV-2 genomes. The global outbreak of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is currently and increasingly recognized as a serious, public health concern worldwide. To date, a total of seven types of coronaviruses that can infect humans have been found. Four coronaviruses of them could cause a cold, including hCoV-229E, hCoV-NL63, hCoV-OC43 and hCoV-HKU1, while the other three viruses usually cause mild to severe respiratory diseases, including Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV), Middle East Respiratory Syndrome Coronavirus (MERS-CoV) and SARS-CoV-2. Particularly, SARS-CoV-2 has showed a greater adaptation to the human host compared to the other coronaviruses and the other reference hosts (Dilucca, et al., 2020) . The total number of SARS-CoV and MERS-CoV infections are only 8069 and 2494 with reproduction number (R0) fluctuating from 2.5~3.9 and 0.3~0.8, respectively. However, SARS-CoV-2 has infected 2471136 people in 212 countries by 2020-04-22 (WHO, 2020) , with the basic R0 ranging from 1.4 to 6.49 (Liu, et al., 2020) . Among these three typical coronavirus, MERS-CoV has the highest death rate of 34.40%, SARS-CoV has the modest death rate of 9.59%, SARS-CoV-2 is about 6.99% and 7.69% death rate of the global and Wuhan. However, some European countries have a quite high death rate, such as Belgium, Italy, United Kingdom, Netherlands, Spain and France, which even reach 14.95%, 13.39%, 13.48%, 11.61%, 10.42% and 13.60%, respectively, according to the data from 2020-04-22. Except for the shortage of medical supplies and aging, it is not clear whether there is a virus mutation effect in these countries with such a significantly increased death rate. It has been reported that SARS-CoV-2 belongs to beta-coronavirus and is mainly transmitted by the respiratory tract, which belongs to the same subgenus (SarbeCoVirus) as SARS-CoV (Lu, et al., 2020) . Through analyzing the genome and structure of SARS-CoV-2, its receptor-binding domain (RBD) is found to bind with angiotensin-converting enzyme 2 (ACE2), which is also one of the receptors for binding SARS-CoV (Wrapp, et al., 2020) . Some previous genomic studies have shown that SARS-CoV-2 is similar to certain bat viruses (RaTG13, with the whole genome homology of 96.2% (Zhou, et al., 2020) ) and Malayan pangolins coronaviruses (GD/P1L and GDP2S, with the whole genome homology of 92.4% (Lam, et al., 2020) ). They have speculated several possible origins of SARS-CoV-2 based on its spike protein characteristics (cleavage sites or the RBD) (Lam, et al., 2020 , Zhang and Holmes, 2020 , Zhou, et al., 2020 . In particular, the RBD of SARS-CoV-2 exhibits 97.4% amino acid sequence similarity to that of Guangdong pangolin coronaviruses, even though it is most closely related to bat coronavirus RaTG13 at the whole genome level. However, it is not enough to present genomewide evolution by a single gene or local evolution of RBD, and whether bats or pangolins play an important role in the zoonotic origin of SARS-CoV-2 remains uncertain (Andersen, et al., 2020) . Since there is little known about SARS-CoV-2 epidemic trends, origins and whether it has significant variation that affects its phenotype, we integrated various evolution analysis methods to further reveal the phylogenetic evolution and molecular characteristics of the whole genome of SARS-CoV-2 based on a large number of genomes and to provide some basis for the prevention and treatment of SARS-CoV-2. J o u r n a l P r e -p r o o f The complete genome sequences of SARS-CoV-2 were downloaded from China National Center for Bioinformation (https://bigd.big.ac.cn/ncov/release_genome/) and GISAID (https://www.gisaid.org/) by 2020-03-22. The sequences were filtered out according to the following criteria: (1) sequences with ambiguous time; (2) sequences with low quality which contained the counts of unknown bases > 15 and degenerate bases > 50 (https://bigd.big.ac.cn/ncov/release_genome); (3) sequences with similarity of 100% were removed to unique one. Finally, 624 high quality genomes with precise collection time were selected and aligned using MAFFT v7 (Katoh and Standley, 2013) with automatic parameters. Besides, the genome sequences of 7 SARS-CoV and 475 MERS-CoV were also downloaded from the National Center for Biotechnological Information (NCBI) (https://www.ncbi.nlm.nih.gov/), and the MERS-CoV dataset which includes samples collected from both human and camel. In addition, for further exploring evolution and molecular characteristics of SARS-CoV-2 based on the larger amount of genomic data, we redownloaded validation datasets of the genome sequences from GISAID by 2020-04-06 and 2020-05-10 respectively. The average rates of non-synonymous (Ka), rates of synonymous (Ks) and ratio of the rates of nonsynonymous to synonymous changes (Ka/Ks) for all coding sequences were calculated using KaKs_Calculator v1.2 (Zhang, et al., 2006) , and the substitution rate and tMRCA were estimated using BEAST v2.6.2 (Bouckaert, et al., 2019) . The temporal signal with root-to-tip divergence was visualized J o u r n a l P r e -p r o o f in TempEst v1.5.3 (Rambaut, et al., 2016) using a Maximum Likelihood (ML) whole genome tree with bootstrap value as input. For SARS-CoV and SARS-CoV-2, we selected a strict molecular clock and Coalescent Exponential Population Model. For MERS-CoV, we selected a relaxed molecular clock and Birth Death Skyline Serial Cond Root Model. We used the tip dates and chose the HKY as the site substitution model in all these analyses. Markov Chain Monte Carlo (MCMC) chain length was set to 10,000,000 steps sampling after every 1000 steps. The output was examined in Tracer v1.6 (http://tree.bio.ed.ac.uk/software/tracer/). Each genome sequence was aligned to the reference genome (NC_045512.2) using bowtie2 (Langmead and Salzberg, 2012) with default parameters, and variants were called by samtools (sort; mpileup -gf) and bcftoots (call -vm). The merge Variant Call Format (VCF) files were created by bgzip and bcftools (merge --missing-to-ref) (Li, 2011 , Li, et al., 2009 ). After alignment of 624 SARS-CoV-2 genomes with high quality and manually deleted 2 highly divergent genomes (EPI_ISL_415710, EPI_ISL_414690) according to the firstly constructed phylogenetic tree, the aligned dataset of 622 sequences was phylogenetically analyzed. The Smart Model Selection (SMS) method was used to select GTR+G as the base substitution model (Lefort, et al., 2017) . PhyML 3.1 (Guindon, et al., 2010) and MEGA (Kumar, et al., 2018) were used to construct the no-root phylogenetic tree by the ML method with the bootstrap value of 100. The online tool iTOL (Letunic and Bork, 2019) was used to visualized the phylogenetic tree. The clusters were defined by the shape of phylogenetic tree. Information (ID, countries/regions and collection time) and variants (NC_045512.2 as reference genome) of each genome from each Cluster were extracted. The allele frequency and nucleotide divergency (pi) for each site in the virus population of each Cluster were measured by vcftools (Danecek, et al., 2011) . The Fst were also calculated by vcftools (Danecek, et al., 2011) to assess the diversity between the Clusters. Sites with high level of Fst together with different major allele in each Cluster were filtered as the specific sites. Principal Components Analysis (PCA) were analyzed by the GCTA v1.93.1beta (Yang, et al., 2011) with the specific sites and all SNV dataset. The linkage disequilibrium of the specific sites were analyzed by haploview (Barrett, et al., 2005) , and the statistics of the haplotype of the specific sites for each Cluster or country were used in-house perl script. The Templeton, Crandall and Sing (TCS) network is constructed using an agglomerative approach where clusters are progressively combined with one or more connecting edges (Cotten, et al., 2014) , and the minimum spanning network (MSN) networks contain all edges that appear in a minimum spanning tree (Leigh, et al., 2015) . Hence, in order to estimate genealogical relationships of haplotype groups, the phylogenetic networks were inferred by PopART package v1.7.2 (Leigh, et al., 2015) using TCS method and MSN respectively. The frequencies of specific sites for each country were calculated. The death rate was estimated with Total Deaths /Confirmed Cases based on the data from Johns Hopkins resources on 2020-05-12 (https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6).The J o u r n a l P r e -p r o o f correlation coefficient between death rate and frequencies of specific site or haplotype in different countries was calculated using Pearson method. We got a total of 1053 genomic sequences by 2020-03-22. According to the filter criteria, 37 sequences with ambiguous time, 314 with low quality and 78 with similarity of 100% were removed for further analysis. A total of 624 sequences were obtained to perform multiple sequences alignment. Two highly divergent sequences (EPI_ ISL_414690, EPI_ISL_415710) according to the firstly constructed phylogenetic tree were also filtered out (Table S1 ). The remaining 622 sequences were used to reconstruct a phylogenetic tree. In addition, a total of 3624 and 16373 genomic sequences were redownloaded by 2020-04-06 and 2020-05-10 respectively for further exploring the evolution and molecular characteristics of SARS-CoV-2. The average Ka/Ks for all the coding sequences of 622 genome sequences ranging from 2019-12-26 to 2020-03-18 was closer to 1 (1.008), indicating that the genome was going through a neutral evolution. We also reevaluated the Ka/Ks of SARS-CoV and MERS-CoV through the whole period, and found the ratio were smaller than SARS-CoV-2 (Table 1) . To estimate the more credible Ka/Ks for SARS-CoV-2, we recalculated it using redownloaded 3624 genome sequences ranging from 2019-12-26 to 2020-03-18. As a result, its average Ka/Ks was 1.094 (Table 1) , which was almost same with the above J o u r n a l P r e -p r o o f result. We assessed the temporal signal using TempEst v1.5.3 (Rambaut, et al., 2016) . All three datasets exhibited a positive correlation between root-to-tip divergence and sample collecting time ( Figure S1 ), so that they were suitable for molecular clock analysis in BEAST (Bouckaert, et al., 2019 , Rambaut, et al., 2016 . The substitution rate of SARS-CoV-2 genome was estimated to be 1.601 x 10 -3 (95% CI: 1.418 x 10 -3 ~ 1.796 x 10 -3 , Table 2, Figure S2A ) substitution/site/year, which was in the same order of magnitude as SARS-CoV and MERS-CoV. The tMRCA was inferred in late September, 2019 (95% CI: 2019-08-08 ~ 2019-10-26,Table 2, Figure S2B ), which was about two months before the early cases of SARS-CoV-2 (Huang, et al., 2020) . The no-root phylogenetic trees constructed by the ML method with PhyML 3.1 and MEGA were showed in Figure 1 and Figure S3 . According to the shape of phylogenetic trees, we divided 622 sequences into three clusters (Figure 1 ): Cluster 1 including 76 sequences mainly from North America, Cluster 2 including 367 sequences from all regions of the world, and Cluster 3 including 179 sequences mainly from Europe (Table S2 ). The Fst and population frequency of a total of 9 sites (NC_045512.2 as reference genome) were detected (Table 3, Table S3 ). Thereinto, three (C17747T, A17858G and C18060T) were the specific sites of Cluster 1, and four (C241T, C3037T, C14408T and A23403G) were the specific sites of Cluster 3. Notably, C241T was located in the 5'-UTR region and the others were located in coding regions (6 in ofr1ab gene, 1 in S gene and 1 in ORF8 gene). Five of them were missense variant, including C14408T, C17747T and A17858G in ofr1ab gene, A23403G in S gene, and T28144C in ORF8 gene. The PCA results showed that these 9 specific sites could clearly separate the three Clusters, while all SNV dataset could not clearly separate Cluster 1 and Cluster 2 ( Figure S4 ), which further suggested that these 9 specific sites were the key sites for separating the three Clusters. We found the 9 specific sites were highly linked based on 622 genome sequences (Figure 2A ), then we carried out a further linkage analysis using the 3624 genome sequences ( Figure 2B ). As a result, for the 3624 genome sequences, 3 specific sites in the Cluster 1 were almost complete linked, and haplotype CAC and TGT accounted for 98.65% of all the 3 site haplotypes. The same phenomenon was also found in 4 specific sites of Cluster 3, and haplotype CCCA and TTTG accounted for 97.68% of all the 4 site haplotypes. Intriguingly, the 9 specific sites were still highly linked, and four haplotypes, including TTCTCACGT (H1), CCCCCACAT (H2), CCTCTGTAC (H3) and CCTCCACAC (H4), accounted for 95.89% of all the 9 site haplotypes. Thereinto, H1 and H3 had completely different bases at the 9 specific sites. The frequencies of each site and major haplotype for each country were showed in Figure 3 and Table S4 . The data showed that the haplotype TTTG of the 4 specific sites in Cluster 3 has existed globally at present, and still exhibited high frequencies in most European countries but quite low in Asian countries. While the haplotype TGT of the 3 specific sites in Cluster 1 existed almost only in North America and Australia. For the 9 specific sites, most countries had only two or three major haplotypes, except America and Australia which had all of four major haplotypes with relative higher frequencies. All haplotypes of 9 specific sites for 3624 or 16373 genomes and the numbers of them were shown in Table S5 . Four major haplotypes H1, H2, H3 and H4, three minor haplotypes H5, H7 and H8 close to H1 and one minor haplotype H6 close to H3 were found in both datasets. The numbers of these haplotypes for 16373 genomes with clear collection date detected in each country in chronological J o u r n a l P r e -p r o o f order were shown in Figure 4A . As show in these results, H2 and H4 subgroups have existed for a long time (2019-12-24 to 2020-04-28), and the former was with a far greater detected population size. H3 subgroup almost disappeared after detection (2020-02-18 to 2020-04-28), while H1 subgroup was globally increasing with time (2020-02-18 to 2020-05-05), indicating that H1 subgroup has adapted to the human hosts, and was under an adaptive growth period worldwide. However, due to the nonrandom sampling on early phase (only patients with a recent travel to Wuhan were detected), some earlier cases of H3 may be lost, which could be indicated by the high proportion of H3 subgroup during 2020-02-18 to 2020-03-10. H3 and H4 subgroups had a lower Ka/Ks ratio (about 0.7~0.8) than that of H1 and H2 subgroups (about 1.1~1.3) in 3624 or 16373 genomes among the 4 major subgroups ( Table 4 ), suggesting that H3 and H4 subgroups might be going through a purifying evolution and have existed for a long time, while H1 and H2 subgroups might be going through a near neutral or neutral evolution, which were consistent with the above phenomenon that only H1 and H2 subgroups were spreading around the world over time. From the whole genome mutations in each major haplotype subgroup ( Figure 4B , Table S6 ), we found that except the 9 specific sites, there were no common mutations with frequencies more than 0.05 but between H2 and H4 subgroups. Phylogenetic networks were inferred with 697 mutations called from 3624 genomes datasets and the network structures of TCS and MSN were similar. The major haplotype subgroups H2 and H4 were in the middle of the network, while H1 and H3 were in the end nodes of the network ( Figure 5 ). According to the phylogenetic networks, we proposed four hypothesis: (1) the ancestral haplotypes evolved in four different directions to obtain H1, H2, H3 and H4 respectively, or evolved in two or more different directions to obtain two or more major haplotypes and then involved into the other major haplotype(s); (2) H2 or H4 evolved in the different directions respectively and finally generated H1 and H3; (3) H1 evolved in one direction to generate H2 and H4, and then evolved into H3; (4) H3 evolved in one direction to generate H4 and H2, and then evolved into H1. We cannot exclude the first hypothesis based on the present data, however, according to the Ka/Ks, detected population size and development trends in chronological order of each major haplotype subgroup, if there are evolutionary relationships among the four major haplotypes, we speculate that the most likely evolution hypothesis is that H3 and H4 are the earliest haplotypes, which is gradually eliminated with selection, while H2 is the transitional haplotype in the evolution process, and H1 may be the finally fixed haplotype. To explore the relationship between death rate and the 9 specific sites, we used Pearson method to calculate the correlation coefficient between death rate and frequency of each specific site or major haplotype in 17 countries with 3624 genomes at early stage. As a result, all r values of 241T, 3037T, 14408T, 23403G and haplotype TTTG and H1 were more than 0.4 ( Figure 6 , Table S4 ). We also evaluated the correlation coefficient with 16373 genomes in 30 countries, the r values of haplotype TTTG and H1 were still more than 0.37 (Table S4) . Integrated with their high frequencies in most European countries, these finding indicated that the 4 sites (C241T, C3037T, C14408T and A23403G) and haplotype TTTG and H1 might be related to the pathogenicity of SARS-CoV-2. To explore the relationship between infectivity and the 9 specific sites, we used population size of major haplotypes to deduce the possible specific sites related infectivity. We assumed that these major haplotypes in each country were subject to similar virus transmission and control patterns, while the population sizes of H1 and H2 subgroups were far greater than those of H3 and H4 subgroups ( Figure 4A , Table S4 ). Thus, the common different specific sites of H1 and H2 subgroups with H3 and H4 subgroups, C8782T and T28144C, might be related to the infectivity of SARS-CoV-2, and the virus with C8782 and T28144 might be more infectious than those virus with 8782T and 28144C. SARS-CoV-2 poses a great threat to the production, living and even survival of human beings . With the further outbreak of SARS-CoV-2 in the world, comprehensive understanding on the evolution and molecular characteristics of SARS-CoV-2 based on a large number of genome sequences will help us cope better with the challenges brought by SARS-CoV-2. Exploring evolution rate, tMRCA and phylogenetic tree of SARS-CoV-2 would help us better understand the virus (Yuen, et al., 2020) . The average Ka/Ks for all the coding sequences of 622 and 3624 SARS-CoV-2 genomes was 1.008 and 1.094, which was higher than those of SARS-CoV and MERS-CoV, indicating that the SARS-CoV-2 was going through a neutral evolution. Interestingly, we also found that the subgroups of different haplotypes (H1, H2, H3 and H4) seemed to undergo the different evolutionary patterns according to their Ka/Ks. The H3 subgroup was disappeared soon after the detection (2020-02-18 to 2020-04-28, Figure 4A ), while H1 subgroup was globally increasing with time. These characteristics of evolution and change should be considered in developing therapeutic drugs and vaccines. The tMRCA of SARS-CoV-2 was inferred in late September, 2019 (95% CI: 2019-08-28 ~ 2019-10-26), about two months before the early cases of SARS-CoV-2 (Huang, et al., 2020) . We also estimated tMRCA of SARS-CoV and MERS-CoV with the same methods, both were about 3 months later than the corresponding tMRCAs estimated by previous studies (Cotten, et al., 2014 , Zhao, et al., 2004 . A recent study used TreeDater method to estimate tMRCA for more than 7000 SARS-CoV-2 genomes and indicated the tMRCA of SARS-CoV-2 was around 2019-10-06 to 2019-12-11, which was in broad agreement with six previous studies all performed on no more than 120 early SARS-CoV-2 genomes with BEAST method (van Dorp, et al., 2020) . In our study, we chose the most common method BEAST to estimate tMRCA of SARS-CoV-2 based on 622 genomes and the results of SARS-CoV and MERS-CoV are consistent with the previous studies, indicating that the estimated tMRCA in the present study are reliable. A recent study clustered 160 SARS-CoV-2 whole-genome sequences into A, B and C groups through a phylogenetic network analysis by taking bat RaTG13 as a root (Forster, et al., 2020) . The clustering result was similar to our study: both the samples in Cluster A and our Cluster 1 were mainly from the United States; the samples in Cluster C and our Cluster 3 were mainly from European countries, while Cluster B and our Cluster 2 were mainly from China and the other regions. It is interesting that the markers C8782T and T28144C, which were discovered by Yu et al. (Yu, et al., 2020) , in Cluster B were also found in our study, but the other markers in Cluster A (T29095C) and Cluster C (G26144T) were not significantly in our study, which may be caused by different sample sizes and different constructing methods of phylogenetic tree. Based on the base substitution model, the ML method avoids the possible "long-branch attraction" problem in the maximum parsimony method and is faster than the Bayesian method (Holder and Lewis, 2003) , indicating that it could be used as a reliable method for phylogenetic analysis. Some studies used the genome of bat SARS-like-CoV , RaTG13 or MT019529 (https://bigd.big.ac.cn/ncov/tree) as the root of phylogenetic tree. Unfortunately, there is no obvious evidence showing that SARS-CoV-2 is from the bat coronavirus even the identity between SARS-CoV-2 and RaTG13 is up to 96.2% (Zhou, et al., 2020) . In our study, the tMRCA of SARS-CoV-2 was inferred in late September 2019, which indicated there might exist an earlier SARS-CoV-2 strain we didn't find. In the case of unclear source of SARS-CoV-2 and high homology of its genomes (> 99.9% homology), it may be inappropriate to identify the evolutionary characteristics inside the genomes by taking bat SARS-like-CoV, RaTG13 or MT019529 as a root. Therefore, the ML method was used in current study to construct a no-root tree to obtain the reliable clusters with different characteristics. Based on the no-root tree, we identified 9 specific sites of highly linkage that played a decisive role in the classification of clusters successfully. Among the 9 specific sites, 8 of them are located in coding regions (6 in ofr1ab gene, 1 in S gene and 1 in ORF8 gene), and 5 of them are missense variant, including C14408T, C17747T and A17858G in ofr1ab gene, A23403G in S gene, and C28144T in ORF8 gene. Our study found that the 4 specific sites (C241T, C3037T, C14408T and A23403G) in the Cluster 3 were almost complete linked, and the frequency of haplotype TTTG was generally high in European countries and correlated to death rate (r>0.37) based on 3624 or 16373 SARS-CoV-2 genomes, which provides a new perspective to the J o u r n a l P r e -p r o o f reasons of relatively high death rate in Europe, and provides a new opportunity in designing new vaccine and drug development of SARS-CoV-2. Two possible specific sites C8782T and T28144C related to the infectivity of SARS-CoV-2 were also revealed in the present study, which would provide basis for SARS-CoV-2 epidemiology. Among these genes with specific sites, Orf1ab is composed of two partially overlapping open reading frames (orf), orf1a and 1b. It is proteolytic cleaved into 16 nonstructural proteins (nsp), including nsp1 (suppress antiviral host response), nsp9 (RNA/DNA binding activity), nsp12 (RNA-dependent RNA polymerase), nsp13 (helicase) and others , indicating the vital role of it in transcription, replication, innate immune response and virulence (Graham, et al., 2008) . C14408T with high frequencies of T in European countries were located at nsp12 region, indicating that this missense variant might influence the role of RNA polymerase. Spike glycoprotein, the largest structural protein on the surface of coronaviruses, comprises of S1 and S2 subunits mediating binding the receptor on the host cell surface and fusing membranes, respectively (Li, 2016) . It has been reported that S protein of SARS-CoV-2 can bind ACE2 with higher binding affinity than that of SARS (Wrapp, et al., 2020 , Zhou, et al., 2020 . Recently, several in vitro studies (Daniloski, et al., 2020 , Hu, et al., 2020 posted on preprint showed that A23403G in S gene (D614G mutation in S protein) could promote virus entry into host cells and enhance the infectivity of host cells by 2~7 times, which have partially verified our conclusions. It seems to take a long time to finally fix mutations according to the mutation frequency of each subgroup. For example, H2 and H4 subgroups, which have been detected for more than four months from 2019-12-24 to 2020-05-05 ( Figure 4A) , have more mutations with higher frequencies, but the highest mutation frequency is only 0.486 at the position of 11083 ( Figure 4B , Table S6 ). From these phenomena, it can be inferred that it takes a long time for the specific sites of each major subgroup to be fixed, but it may be faster if the early population is smaller. In addition, there is also the possibility that an ancestor strain evolved in four directions by obtaining the specific mutations directly and produced the four current major haplotypes, so the evolution time for obtaining four major haplotypes may be shorter, which seems to be consistent with the phenomenon that four major haplotypes can be detected in two months ( Figure 4A ). However, if there exists the evolution relationship among the major haplotypes of SARS-CoV-2, it is difficult to complete the evolution among the four major haplotypes within two months (2019-12-24 to 2020-02-18, Figure 4A ) at the current evolution rate of each major haplotype population (Table 4 ). Therefore, we speculate that the transformation among the four major haplotypes may have been completed for a long time, which have not been detected. It is interesting that only United States and Australia among 29 countries have all of four major haplotypes with relative higher frequencies ( Figure 4A , Table S4 ), which indicated that the two countries are the most likely places where the virus appeared earlier based on the present data. The Ka/Ks ratio and tMRCA of SARS-CoV-2 indicated that SARS-CoV-2 might have completed the selection pressure of cross-host evolution in the early stage and be going through a neutral evolution at present. The 9 specific sites with highly linkage were found to play a decisive role in the classification of clusters. Several key specific sites and haplotypes related to infectivity or pathogenicity of SARS-CoV-2 as well as the possible earlier origin time and place of SARS-CoV-2 were indicated based on evolution and epidemiology of 16373 SARS-CoV-2 genomes. The relationship between the key J o u r n a l P r e -p r o o f The 622 sequences were clustered into three clusters: Cluster 1 were mainly from North America, Cluster 2 were from regions all over the world, and Cluster 3 were mainly from Europe. Tables Table 1 Statistics (Cotten, et al., 2014) The proximal origin of SARS-CoV-2 Haploview: analysis and visualization of LD and haplotype maps BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan Spread, circulation, and evolution of the Middle East respiratory syndrome coronavirus. mBio The variant call format and VCFtools The D614G mutation in SARS-CoV-2 Spike increases transduction of multiple human cell types Codon Usage and Phenotypic Divergences of SARS-CoV-2 Genes Phylogenetic network analysis of SARS-CoV-2 genomes SARS coronavirus replicase proteins in pathogenesis New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0 The origin, transmission and clinical therapies on coronavirus disease 2019 (COVID-19) outbreak -an update on the status Phylogeny estimation: traditional and Bayesian approaches The D614G mutation of SARS-CoV-2 spike protein enhances viral infectivity and decreases neutralization sensitivity to individual convalescent ser Clinical features of patients infected with 2019 novel coronavirus in Wuhan MAFFT multiple sequence alignment software version 7: improvements in performance and usability MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms Identifying SARS-CoV-2 related coronaviruses in Malayan Fast gapped-read alignment with Bowtie 2 SMS: Smart Model Selection in PhyML popart: full-feature software for haplotype network construction Interactive Tree Of Life (iTOL) v4: recent updates and new developments Structure, Function, and Evolution of Coronavirus Spike Proteins A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data The Sequence Alignment/Map format and SAMtools The reproductive number of COVID-19 is higher compared to SARS coronavirus Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen) Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation GCTA: a tool for genome-wide complex trait analysis Decoding the evolution and transmissions of the novel pneumonia coronavirus (SARS-CoV-2 / HCoV-19) using whole genomic data SARS-CoV-2 and COVID-19: The most important research questions The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity Origin and Evolution of the 2019 Novel Coronavirus A pneumonia outbreak associated with a new coronavirus of probable bat origin HD conceived the study. YM, YB, DJ, JL and XC carried out the data analysis and wrote the manuscript. MH, SL, and ZC collected data and revised the manuscript. XW attended the discussions.HD and YM supervised the whole work and revised the manuscript. The authors declare no conflict of interest. Not required. Electronic supplementary tables are available online at https://figshare.com/articles/dataset/SARS-CoV-2_genomes_evolution/12366449