key: cord-0863751-bvs8j5bc authors: Lin, Ziying; Qing, Hua; Li, Rui; Zheng, Lei; Yao, Huipeng title: Evolution trace of SARS‐CoV‐2 from January 19 to March 12, 2020, in the United States date: 2021-07-28 journal: J Med Virol DOI: 10.1002/jmv.27225 sha: 67b58c16cae7aed061604fb5f389333264f78b9f doc_id: 863751 cord_uid: bvs8j5bc As a kind of human betacoronavirus, SARS‐CoV‐2 has endangered globally public health. As of January 2021, the virus had resulted in 2,209,195 deaths. By studying the evolution trend and characteristics of 265 SARS‐CoV‐2 strains in the United States from January to March, it is found that the strains can be divided into six clades, USA clade‐1, USA clade‐2, USA clade‐3, USA clade‐4, USA clade‐5, and USA clade‐6, in which US clade‐1 may be the most ancestral clade, USA clade‐2 is an interim clade of USA clade‐1 and USA clade‐3, the other three clades have similar codon usage pattern, while USA clade‐6 is the newest and most adaptable clade. Mismatch analysis and protein alignment showed that the evolution of the clades arises from some special mutations in viral proteins, which may help the strain to invade, replicate, transcribe and so on. Compared with previous research and classifications, we suggest that clade O in GISAID should not be an independent clade and Wuhan‐Hu‐1 (EPI_ISL_402125) should not be an ancestral reference sequence. Our study decoded the evolutionary dynamic of SARS‐CoV‐2 in the early stage from the United States, which give some clues to infer the current evolution trend of SARS‐CoV‐2 and study the function of viral mutational protein. ORF1a 9 and blocking the activation of innate immune by deubiquitinating and deISG15ylating. [10] [11] [12] [13] Through the capacity, after binding to ADP-ribose, Nsp3 removes the chemical group from the ADPribosylated proteins in immune response. [14] [15] [16] In addition, Nsp3 can induce double-membrane vesicles outside viral replicationtranscriptional complexes (RTC) with Nsp4 and Nsp6. 17 Nsp5, as the main protease, can hydrolyze the sequence (L/V/F)Q↓(S/A/G), catalyzing polyprotein 1ab to produce Nsp4 to Nsp10 and Nsp12 to Nsp16, 12 proteins. 18 Nsp7 and Nsp8 form a heterodimer which binds to RNA dependent RNA polymerase, stabilizing the enzyme polymerase domain, increasing the affinity of RNA binding to the enzyme significantly and enhancing the catalytic activity. 4, 19, 20 Nsp9 is a kind of single stranded RNA-binding protein, which may influence the virulence of the virus. 21 Nsp10 assists methyltransferase to form a methylation complex which caps the virus mRNA and enhances the enzyme activity. 22 Nsp12 is a kind of RNA dependent RNA polymerase (RdRp), which catalyze the replication and transcription of virus genome after forming RTC with other Nsps. Beside, with Nsp8, it can regulate the activity of helicase. 23 Nsp13 is a kind of helicase, participating in virus replication and preservation in the life activity of virus. 24 Nsp14 has a N-terminal exonuclease domain and a C terminal guanine-N7 methyl transferase domain, in which, the former is responsible for proofreading new error nucleotides and the latter is involved in capping the virus RNA from the degradation of the host immune. [25] [26] [27] Nsp15 is a kind of ribonucleic acid endonuclease, 28 which can prevent the host from detecting the virus double stranded RNA to escape the attack of the immune system. Nsp16 is a specific methyl transferase, which catalyzes 2'-O-methylation of the first nucleotide in the viral capped RNA to be protected from the degradation of host innate immune response. 29, 30 SARS-CoV-2 has spike (S), membrane (M), envelope (E), nucleocapsid (N), four kinds of Nsps. S protein helps the virus enter host cells by binding to human ACE2 receptors, [31] [32] [33] and induces inflammatory response after recognized by TLR4. 33 N protein is responsible for packaging viral RNA into helical ribonucleocapsid, forming viral nucleocapsid structure with M protein. [34] [35] [36] E may play an important role in virus maturation, transmission and reproduction. 4 Besides, the virus has several accessory proteins, such as: ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF9, and ORF10. 4 As a transmembrane protein, ORF3a is related to virion release and viral pathogenicity. 37, 38 ORF7a may play a role in protein transport mediated by endoplasmic reticulum and Golgi complex. 4 The exogenous overexpression of ORF8 in cells can destroy IFN-I signal from the host. 39 ORF9b, as a part of N protein, inhibits host immune response. 40 SARS-CoV-2 mutates rapidly, producing a large number of lineages or clades by different methods. As of February 12, 2020, Yu et al. 41 found that SARS-CoV-2 is classified into five groups including 58 haplotypes, in which, H13 and H35 were ancestral haplotypes and H1 (which from the Hua Nan market) was derived from the H3 haplotype. Two months later, Peter et al. 42 50 found that many recurrent mutations occurred in S protein, Nsp6, Nsp11, and Nsp3 and nonsynonymous mutations account for nearly 80%. Takahiko et al. 51 found that C3037T synonymous sites in genome, P4715L in ORF1ab and D614G in S protein were the most frequent mutations, after aligning 10,022 SARS CoV-2 genomes between February 1, and May 1, 2020. On May 7, 2020, Yujiro et al. 52 found that P4715L of ORF1ab and D614G of S protein are linkage and lethality. In July 2020, Domenico et al. 53 found two mutations at Nsp6 position 37 and ORF 10 position 3 or 4, reducing the two proteins structure stability. In July 2020, Phan 54 found three mutations (N354D, D364Y and V367F) on the surface of S protein, which may change its conformation, resulting in changes in antigenicity of the virus. On October 2, 2020, Sarmilah et al. 55 found that for S protein, the N501Y mutation was more infectious than the D614G mutation. On December 18, 2020, Yixuan et al. 56 found that after the mutation D614G in S protein, the variant has more effective infection, replication and competitive adaptability in human primary airway epithelial cells. In this study, we firstly try to use all ORFs combined sequence to analyze the phylogenetical relation of the strains in the United States from January 19 to March 12, 2020. Then, to test our phylogenetic tree we ran a series of follow-up analysis. We calculate each ORF nucleotide substitution rates, analyze its codon usage pattern and population expansion and align each corresponding protein sequences. Finally, we compare our phylogenetical result with others, to reveal the evolution dynamics of SARS-CoV-2 in early stage of US epidemic outbreak, assess the characteristics of its classification and understand the trend of the epidemic. On May 12, 2020, flagged as "complete (>29,000 bp)" and "high coverage," 683 SARS-CoV-2 genomes from January 19 to March 12 in the United States were downloaded from the GISAID Initiative EpiCoV platform. Filtering any sequences with N, W, and other missing site produces a final data set of 265 genomes. In addition, two reference ancestral sequences, hCoV-19/bat/Yunnan/RaTG13/2013 and hCoV-19/ Wuhan-Hu-1/2019, are also downloaded from the GISAID. All voucher information of 267 strains can be found in Table S1 . With the aid of the reference sequence, hCoV-19/Wuhan-Hu-1/2019, the tool of ORFfinder in linux x64 is used to look for the meaning ORFs from each genome. Because ORF9 gene is a part of the nucleocapsid gene, and the sequence and function of ORF14 gene can not been found in NCBI and GISAID, 12 ORFs or genes from all genomes are screened. According to the order of the genes on the genome, 12 ORFs sequences were spliced into a large assembly sequence with end to end mode, producing 267 assemblies. To characterize of the virus gene evolution, DNAsp is used to analyze mismatch distribution of each gene and assemblies of all SARS-CoV-2 strains in the United States. Mismatch distribution is a way to visually reflect the historical dynamics of the population. If the mismatch curve shows a unimodal Poisson distribution, it is generally accepted that the population size has experienced expansion or continuous growth. On the other side, if the curve coincides with the expectation curve, the population size remains stable in the past. Twelve viral proteins of all strains are aligned to find out the common and different characteristics of each clade produced by the phylogenetic tree. For the same purpose, we calculate Ka (synonymous substitution rate), Ks (nonsynonymous substitution rate) and to USA clade-6), among which Wuhan-Hu-1 belongs to USA clade-3, indicating it is not appropriate to be as the reference ancestor sequence in many papers. 42, 43, 49, 49, 50, 52 In the phylogenetic tree, it is found that USA clade-1 is the closest to the ancestors (RaTG13) in kinship, followed by USA clade-2, USA clade-3, USA clade-4, USA clade-5, and USA clade-6. In addition, as the largest clade, USA clade- Mismatch distribution can be used to seek for the trace of virus population expansion. 41 The mismatch distribution of the assemblies and different genes are shown as Figures 3 and S4. From Figure 3 , it is known the mismatch distribution curve shows a multipeak state, suggesting that the assemblies used to have experienced several F I G U R E 1 Phylogenetic trees of 265 SARS-CoV-2 strains in the United States from January 19 to March 12, 2020, which is rooted by RaTG13 and compared with Wuhan-Hu-1. The yellow colour represent ancestor RaTG13, red represent USA clade-1, green represent USA clade-2, orange represent USA clade-3, blue represent USA clade-4, pink represent USA clade-5, purple represent USA clade-6, and grey represent Wuhan-Hu-1. The details of each clade are shown in Figure S1 F I G U R E 2 Bayesian phylogenetic tree of 265 SARS-CoV-2 strains in the United States from January 19 to March 12, 2020, showing the divergency time of six clades. The red colour represent USA clade-1, green represent USA clade-2, orange represent USA clade-3, blue represent USA clade-4, pale pink represent USA clade-5 and USA clade-6 use purple to represent. The details of each clade are shown in Figure S2 dominant mutations in SARS-CoV-2 evolution, which may help itself better adapt to the environment at that time. Combining Figures S4A, S4B, S4C , and S4D in Figure S4 , it is seen that the mismatch distribution curves of ORF1a, ORF1b, S and ORF3a show a pattern of unimodal Poisson distribution, which means that the four genes had experienced a rapid variation expansion in the past and the genes mutants have been significantly increasing. From protein analysis in Tables 1 and S2, it is known that the ORF1a mutation T265I may play an important role in the evolution from USA clade-5 to USA clade-6, the same as the ORF1b mutation P4715L from USA clade-3 to USA clade-4, S mutation D614G from USA clade-3 to USA clade-4 and the ORF3a mutation Q57H in from USA clade-4 to USA clade-5. Linked to the function of the proteins, it is speculated that the four mutations proteins may help SARS-CoV-2 easily bind to the host and beneficially proliferate themselves. On the contrary, from the other figures, E to K in Figure S4 , it is known that mismatch distribution curves of other genes are similar to their expectation curves, indicating that these genes are relatively conservative in the evolution of SARS-CoV-2. Note: Bold font represent mutation sites of six USA clades. The capital letters represents an amino acids. The numbers after clades and amino acids represent the number of virus strains. "Position" means the site of an amino acid in gene shown in the preceding row. The details can be found in Table S2 . The synonymous substitution rate (Ks), nonsynonymous substitution rate (Ka) and the ω ratio (ω = Ka/Ks) can be used to infer the pressure of evolution in genes of SARS-CoV-2. According to the theory of Michael et al., 57 Ks < 0.05 may indicate that genes experience a phase of accelerated proteins evolution in their early evolution history, followed by the period of a gradual increase in selective constraint, with the progressive decline of ω. F I G U R E 4 Ks: Nucleotide synonymous substitution rates (A), Ka: nonsynonymous substitution rates (B); ω: the ratio of nonsynonymous substitution rate to synonymous substitution rate (C) of different clades in each gene and the assemblies Ks, Ka, and ω of viral gene from each clade can be seen as Figure 4 and Table 2 , from which it is found that Ks of each gene is far <0.05 in all clades, indicating the viral genes are in the early stage of the evolution. For example, Ks of ORF8 is zero in all clades and Ka is also zero in five clades, suggesting ORF8 gene may be in the earliest period of evolutionary process. It is the same as other genes, such as ORF3a, ORF7a, M in most of clades, and S protein in clade-4 to clade-6. In the period, SARS-CoV-2 is experiencing accelerated evolution of some proteins in several clades, such as S in clade-4 to clade-6 (Ks = 0, Ka > 0). In addition, for each viral gene, the value of Ks is steady among different clades except for ORF1b in clade-2 and clade-3, partly verifying the assumption that synonymous substitutions are largely immune from selection and accumulate at a stochastic rate that is proportional to time. On the other aspect, similar to Ks, the value of Ka is also steady among all clades except for ORF1a in clade-2 and clade-3. Except for the genes of Ks = 0 or Ka = 0, the value of ω is smaller than 1, meaning that most of genes from SARS-CoV-2 were mainly affected by purifying selection. From Figure 4 , it is observed that for ORF1b in USA clade-2, USA clade-3 and N in USA clade-4, the values of ω are higher, indicating both of genes are under greater purifying pressure. For ORF1a, ω progressively decline from USA clade-2 to USA clade-6, which may mean that the strain gradually adapts to the environment with evolutionary process. PCA analysis is usually used to analyze the potential evolutionary trends of genes codon usage patterns. PCA analysis plot of 265 assemblies is shown in Figure 5 , from which it is seen that USA clade-1 distributes in the first quadrant, USA clade-2 distributes near the horizontal positive (Table S3 and Figure 6 ). The study of BII/GIS from A*STAR Singapore as of January 22, 2021 47 demonstrates that clade S is the ancestor, from which the clades of L and V are diverged, followed by clade G, clade GH, clade GR, clade GV and so on. Yu et al. 41 also found that H13 and H35 (belonging to S clade) were ancestral haplotypes, which is simlar to our conclusion. Wuhan 52 Similarly, during the evolution from the first four clades to the last two clades, the mutation Q57H in ORF3a protein may enhance the capacity of virion release and viral pathogenicity. 37, 38 During the evolution from the first two clades to the last four clades, the mutation S84L in ORF8 protein may help virus escape from immune response. 39 During the evolution from other clades to the USA clade-6, the mutation T265I in ORF1a protein may help SARS-CoV-2 to form an advantage group. Moreover, for USA clade-4, newly strong linkage of R203K and G204R in N protein may help virus to for helical ribonucleocapsid 36 All in all, we do not only describe the evolutionary process in six clades of SARS-CoV-2 in the early stage from the United States, but also identify their evolutionary direction and characteristics, which supplements the strain classification and helps to infer the current evolution trend of SARS-CoV-2. In addition, we newly find many special mutations in viral proteins in different clades, which lay foundation to study the function of viral mutational protein. Properties of coronavirus and SARS-CoV-2 Treatment for COVID-19: an overview A new coronavirus associated with human respiratory disease in China Structural insights into SARS-CoV-2 proteins The human coronavirus disease COVID-19: its origin, characteristics, and insights into potential drugs and its mechanisms The COVID-19 pandemic: a comprehensive review of taxonomy, genetics, epidemiology, diagnosis, treatment, and control Emerging coronaviruses: genome structure, replication, and pathogenesis A two-pronged strategy to suppress host protein synthesis by SARS coronavirus Nsp1 protein Identification of severe acute respiratory syndrome coronavirus replicase products and characterization of papain-like protease activity The papain-like protease of severe acute respiratory syndrome coronavirus has deubiquitinating activity Deubiquitinating activity of the SARS-CoV papain-like protease The papain-like protease from the severe acute respiratory syndrome coronavirus is a deubiquitinating enzyme Selectivity in ISG15 and ubiquitin recognition by the SARS coronavirus papain-like protease Molecular basis for ADP-ribose binding to the Mac1 domain of SARS-CoV-2 nsp3 Crystal structures of SARS-CoV-2 ADP-ribose phosphatase: from the apo form to ligand complexes The SARS-CoV-2 conserved macrodomain is a highly effificient ADP-ribosylhydrolase Severe acute respiratory syndrome coronavirus nonstructural proteins 3, 4, and 6 induce double-membrane vesicles Virusencoded proteinases and proteolytic processing in the Nidovirales Structure of the RNAdependent RNA polymerase from COVID-19 virus Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir Severe acute respiratory syndrome coronavirus nsp9 dimerization is essential for efficient viral growth A structural view of SARS-CoV-2 RNA replication machinery: RNA synthesis, proofreading and final capping Structural basis for helicasepolymerase coupling in the SARS-CoV-2 replication-transcription complex Delicate structural coordination of the severe acute respiratory syndrome coronavirus Nsp13 upon ATP hydrolysis Structural basis and functional analysis of the SARS coronavirus nsp14-nsp10 complex Nidovirales: evolving the largest RNA virus genome Viruses know more than one way to don a cap Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2 In Vitro Reconstitution of SARS-Coronavirus mRNA Cap Methylation Crystal structure and functional analysis of the SARS-coronavirus RNA Cap 20-Omethyltransferase nsp10/nsp16 complex Structural and functional basis of SARS-CoV-2 Entry by using human ACE2 A pneumonia outbreak associated with a new coronavirus of probable bat origin In silico studies on the comparative characterization of the interactions of SARS-CoV-2 spike glycoprotein with ACE-2 receptor homologs and human TLRs Transient oligomerization of the SARS-CoV N protein-implication for virus ribonucleoprotein packaging Characterization of protein-protein interactions between the nucleocapsid protein and membrane protein of the SARS coronavirus Severe acute respiratory syndromeassociated coronavirus nucleocapsid protein interacts with smad3 and modulates transforming growth factor-b signaling Role of severe acute respiratory syndrome coronavirus viroporins E, 3a, and 8a in replication and pathogenesis. mBio The ORF3a protein of SARS-CoV-2 induces apoptosis in cells The ORF6, ORF8 and nucleocapsid proteins of SARS-CoV-2 inhibit type I interferon signaling pathway SARS-CoV-2 Orf9b suppresses type I interferon responses by targeting TOM70 Decoding the evolution and transmissions of the novel pneumonia coronavirus (SARS-CoV-2) using whole genomic data Phylogenetic network analysis of SARS-CoV-2 genomes Evidence of increasing diversification of emerging SARS-CoV-2 strains Early transmission of SARS-CoV-2 in South Africa: an epidemiological and phylogenetic report Arya PC Evolution and spread of SARS-CoV-2 likely to be affected by climate Sixteen novel lineages of SARS-CoV-2 in South Africa Phylogenetic tree and clades of global area as of January 22, 2021 acquired in platform GISAID The establishment of reference sequence for SARS-CoV-2 and variation analysis Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Variant analysis of SARS-CoV-2 genomes SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 Evolutionary analysis of SARS-CoV-2: how mutation of non-structural protein 6 (Nsp6) could affect viral autophagy Genetic diversity and evolution of SARS-CoV-2 Evaluation of the effect of D614G, N501Y and S477N mutation in SARS-CoV-2 through computational approach SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo The evolutionary fate and consequences of duplicate genes Evolution trace of SARS-CoV-2 from The data that supports the findings of this study are available in the supplementary material of this article. http://orcid.org/0000-0001-5078-6492