key: cord-326666-melz5fq4 authors: Sun, Weitao title: The discovery of gene mutations making SARS-CoV-2 well adapted for humans: host-genome similarity analysis of 2594 genomes from China, the USA and Europe date: 2020-09-03 journal: bioRxiv DOI: 10.1101/2020.09.03.280727 sha: doc_id: 326666 cord_uid: melz5fq4 Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a positive-sense single-stranded virus approximately 30 kb in length, causes the ongoing novel coronavirus disease-2019 (COVID-19). Studies confirmed significant genome differences between SARS-CoV-2 and SARS-CoV, suggesting that the distinctions in pathogenicity might be related to genomic diversity. However, the relationship between genomic differences and SARS-CoV-2 fitness has not been fully explained, especially for open reading frame (ORF)-encoded accessory proteins. RNA viruses have a high mutation rate, but how SARS-CoV-2 mutations accelerate adaptation is not clear. This study shows that the host-genome similarity (HGS) of SARS-CoV-2 is significantly higher than that of SARS-CoV, especially in the ORF6 and ORF8 genes encoding proteins antagonizing innate immunity in vivo. A power law relationship was discovered between the HGS of ORF3b, ORF6, and N and the expression of interferon (IFN)-sensitive response element (ISRE)-containing promoters. This finding implies that high HGS of SARS-CoV-2 genome may further inhibit IFN I synthesis and cause delayed host innate immunity. An ORF1ab mutation, 10818G>T, which occurred in virus populations with high HGS but rarely in low-HGS populations, was identified in 2594 genomes with geolocations of China, the USA and Europe. The 10818G>T caused the amino acid mutation M37F in the transmembrane protein nsp6. The results suggest that the ORF6 and ORF8 genes and the mutation M37F may play important roles in causing COVID-19. The findings demonstrate that HGS analysis is a promising way to identify important genes and mutations in adaptive strains, which may help in searching potential targets for pharmaceutical agents. In December 2019, a novel coronavirus SARS-CoV-2 was reported as the cause of COVID-19. SARS-CoV-2 40 has a positive-sense single-stranded RNA with a length of approximately 30 kb [1] . Studies have shown that 41 considerable genetic diversity exists between SARS-CoV-2 and SARS-CoV [1, 2] . Compared with SARS-speculated that some genes of the two viruses may also exist in the human genome or that the viruses may 111 may accelerate adaptation in humans through increasing HGS of the ORF6 and ORF8 genes and 112 selecting the M37F mutation. However, the underlying mechanism by which these genes and 113 mutations make SARS-CoV-2 more adapted to humans remains unclear. 140 . (2) 141 Here, the E value represents the expected number of times when two random sequences of length m and n 142 are matched and the score is not lower than S'. Parameters K and λ describe the statistical significance of the 143 results [18] . Assuming that the fragment of length matches perfectly in the two random sequences, one has 144 the following formula: 145 . (3) Since the viral genome is quite different from the human genome, matching fragments are usually very short. When is particularly small compared to and , is obtained by combining Equation (3) The SARS-CoV-2 (GenBank: MN908947.3) and SARS-CoV (GenBank: AY394850.2) RNA sequences 162 were used as references to establish the genome organization. SARS-CoV-2 has 14 5'-ORFs, while SARS-163 CoV has 19 5'-ORFs. The length of each ORF is no less than 75 nt ( Table 1) . A quantitative definition of HGS was proposed to investigate the similarity between viral coding sequences 165 (CDSs) and the human genome (Homo sapiens GRCh38.p12 chromosomes). The CDS alignment scores were 166 determined by using NCBI Blastn[17], and HGS was calculated by the formulas described in the Methods 167 for each ORF in the coronavirus genome. The overall HGS of a full-length virus genome was obtained by 168 the weighted sum of ORF HGSs. The weighting factor was the ratio of ORF length to the full-genome length. The ORF lengths of SARS-CoV and SARS-CoV-2 genomes are given in Table 1 . R 13685 13759 75 24 13685 13759 75 24 ORF1b ORF1b 1b 13398 21485 8088 2628 13768 21555 7788 2595 S Sprotein S 21492 25259 3768 1282 21536 25384 3849 1282 N/R N/R N/R 25207 25329 123 40 25332 25448 117 38 ORF3a ORF3 X1 25268 26092 825 274 25393 26220 828 275 E Eprotein E 26117 26347 231 76 26245 26472 228 75 M Mprotein M 26398 27063 666 221 26523 27191 669 222 ORF6 ORF7 X3 27074 27265 192 63 27202 27387 186 61 ORF7a ORF8 X4 27273 27641 369 122 27394 27759 decreased rapidly with increasing HGS (Fig 6) , which provided evidence that there was a power law Of all the gene mutations, the ORF1ab 10818G>T(TTG>TTT) mutation is the most interesting. This mutation 333 survived in all three regions (Fig 10) . In addition, this mutation occurred only in the high HGS population 334 rather than in that with a lower HGS ( . This 10818G>T ORF1ab mutation caused an amino acid mutation, M37F, in the nonstructural 358 protein nsp6, which is located in a loop between the first and second transmembrane domains on the N-359 terminal side (Fig 11) . This finding strongly suggested that the 10818G>T (M37F) mutation survived a The discovery of increased HGS of ORF 6 and ORF 8 provide a strong evidence that SARS-COV-2 evolved 406 to be more adaptable to humans than SARS-CoV. Based on these findings, following conjecture is proposed 407 that the SARS-CoV-2 genes involved in suppressing the host's innate immunity are more powerful. A new coronavirus associated with 427 human respiratory disease in China Evolution of the novel coronavirus from the 430 ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission. 431 SCIENCE CHINA Life Sciences Temporal dynamics in viral shedding and 433 transmissibility of COVID-19 Complete nucleotide sequence of SV40 DNA The nucleotide sequence of repetitive monkey DNA 437 found in defective simian virus 40 Genome sequence of a human 439 tumorigenic poxvirus: prediction of specific host response-evasion genes Prel iminary study on genome homology of viruses and human The evolution of large DNA viruses: combining genomic 444 information of viruses and their hosts Immunity and immunopathology to viruses: what decides the outcome Viral evasion of antigen presentation: not just for peptides anymore SARS-Coronavirus Open Reading Frame-8b 492 triggers intracellular stress pathways and activates NLRP3 inflammasomes Syndrome (SARS) Coronavirus ORF8 Protein Is Acquired from SARS-Related Coronavirus from 496 Greater Horseshoe Bats through Recombination Expression, post-translational 499 modification and biochemical characterization of proteins encoded by subgenomic mRNA8 of the 500 severe acute respiratory syndrome coronavirus Dysregulated Type I Interferon and Inflammatory Monocyte-Macrophage Responses Cause Lethal 506 Pneumonia in SARS-CoV-Infected Mice The establishment of reference sequence 508 for SARS-CoV-2 and variation analysis Emerging SARS-CoV-510 2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant The single-cell RNA-seq data analysis on the 537 receptor ACE2 expression reveals the potential risk of different human organs vulnerable to Wuhan 538 2019-nCoV infection Frontiers of Specific ACE2 Expression in Cholangiocytes 540 May Cause Liver Damage After 2019-nCoV Infection Priapism in a 543 patient with coronavirus disease 2019 (COVID-19): A case report Clinical characteristics of 2019 546 novel coronavirus infection in China SARS coronavirus spike protein-induced innate immune 549 response occurs via activation of the NF-κB pathway in human monocyte macrophages in vitro Human Coronavirus: Host-Pathogen Interaction. Annual review of 552 microbiology Innate Immune Evasion by Human Respiratory RNA Viruses A molecular arms race between host innate antiviral response 557 and emerging human coronaviruses Immune evasion of porcine enteric coronaviruses and viral modulation of 560 antiviral innate signaling Saitoh T, Akira S. Regulation of innate immune responses by autophagy-related proteins Human Coronaviruses: A Review of Virus-Host Interactions Viral Innate Immune Evasion and the Pathogenesis of Emerging RNA 570 Evasion of Host Innate Immunity by Emerging Viruses: Antagonizing Host RIG-I 572 578 shown with special markers at the top of colored blocks representing ORFs. Mutation 623 10818G>T in ORF1ab (codon TTG>TTT) occurred in populations with high HGS, which 624 results in amino acid M37F mutation in transmembrane protein nsp6. The mutation rarely 625 Mutation profile for SARS-CoV-2 genomes (geolocation of Europe) with different HGS Out of a total of 856 viral genomes, 98 genomes have unique HGS values. A total of 145 628 mutations were identified in all the genomes. The top 7 conserved mutations with were shown 629 with special markers at the top of colored blocks representing ORFs. Mutation ORF1ab (codon TTG>TTT) occurred in populations with high HGS, which results in amino 631 acid M37F mutation in transmembrane protein nsp6. The mutation rarely occurred in 632 populations with low/moderate HGS Highly conserved mutations identified in SARS-CoV-2 genomes with geolocations of 634 The three regions have different sets of mutations. The TTT (F Phenylalanine) mutation occurred in all three regions. TTT represents the mutation 636 10818G>T(TTG>TTT) in ORF1ab. The F in the circle represents the amino acid mutation 637 Methionine to Phenylalanine) in nonstructural protein nsp6. The P, H, +, -and S in 638 brackets in the legend represent polar, hydrophobic, positively charged, negatively charged 639 and special residues The topology of transmembrane protein nsp6 and the identified M37F mutation 641 located in a loop between the first and second The accession number and corresponding HGS of 200 SARS-CoV-2 646 genomes with geolocation of China. Filename is DatasetS1_China_SARS-CoV-647 2_nstrain200_ORFHGS_allinone.xls. The file contains accession ID, collection date, location, 648 HGS values for 10 ORFs The accession number and corresponding HGS of 1538 SARS-CoV-2 651 genomes with geolocation of the USA. Filename is DatasetS2_USA_SARS-CoV-652 2_nstrain1538_ORFHGS_allinone.xls. The file contains accession ID, collection date, location, 653 HGS values for 10 ORFs The accession number and corresponding HGS of 856 SARS-CoV-2 656 genomes with geolocation of Europe. Filename is DatasetS3_Europe_SARS-CoV-657 2_nstrain856_ORFHGS_allinone.xls. The file contains accession ID, collection date, location, 658 HGS values for 10 ORFs The accession number and corresponding HGS of 25 SARS-CoV 661 genomes. Filename is DatasetS4_SARS-CoV_nstrain25_ORFHGS_allinone.xls. The file 662 contains accession ID, HGS values for 10 ORFs the weighted HGS of the whole genome