key: cord-0701458-fi7fy523 authors: Nagy, A.; Pongor, S.; Gyorffy, B. title: Different mutations in SARS-CoV-2 associate with severe and mild outcome date: 2020-10-20 journal: nan DOI: 10.1101/2020.10.16.20213710 sha: 2e212dbb289e9802f3169018381ff1d476ede456 doc_id: 701458 cord_uid: fi7fy523 Introduction. Genomic alterations in a viral genome can lead to either better or worse outcome and identifying these mutations is of utmost importance. Here, we correlated protein-level mutations in the SARS-CoV-2 virus to clinical outcome. Methods. Mutations in viral sequences from the GISAID virus repository were evaluated by using hCoV-19/Wuhan/WIV04/2019 as the reference. Patient outcomes were classified as mild disease, hospitalization and severe disease (death or documented treatment in an intensive-care unit). Chi-square test was applied to examine the association between each mutation and patient outcome. False discovery rate was computed to correct for multiple hypothesis testing and results passing a FDR cutoff of 5% were accepted as significant. Results. Mutations were mapped to amino acid changes for 2,120 non-silent mutations. Mutations correlated to mild outcome were located in the ORF8, NSP6, ORF3a, NSP4, and in the nucleocapsid phosphoprotein N. Mutations associated with inferior outcome were located in the surface (S) glycoprotein, in the RNA dependent RNA polymerase, in the 3'-to5' exonuclease, in ORF3a, NSP2 and N. Mutations leading to severe outcome with low prevalence were found in the surface (S) glycoprotein and in NSP7. Five out of 17 of the most significant mutations mapped onto a 10 amino acid long phosphorylated stretch of N indicating that in spite of obvious sampling restrictions the approach can find functionally relevant sites in the viral genome. Conclusions. We demonstrate that mutations in the viral genes may have a direct correlation to clinical outcome. Our results help to quickly identify SARS-CoV-2 infections harboring mutations related to severe outcome. There are seven human coronaviruses including MERS, Human-HKU-1, Human NL63, Human 229E, Human OC43, SARS-CoV, and SARS-CoV-2. The natural host of this latest RNA virus is the Chinese rufous horseshoe bat (Rhinolophus sinicus) and the transfer to human initiated the ongoing COVID-19 outbreak at the end of 2019 1 . The mortality rate of SARS-CoV-2 in the overall population is low 2 , and the infections have at most limited impact on the total number of respiratory-virus associated mortality 3 . However, when the virus strikes a critically ill patient, mortality can go up to 26% 4 . The linear genome of the SARS-CoV-2 virus has 29,903 bases and harbors 25 genes 5 , the reference sequence in accessible in GeneBank using the accession number MN908947. Phylogenetic analysis of SARS-CoV-2 genomes show three variants termed A, B and C which have different distribution when comparing sequences from Asia, Europe or the Americans 6 . The viral genes include among other an envelope protein, an RNA dependent RNA polymerase, a surface glycoprotein, an exonuclease, a methyltransferase, and 11 nonstructural proteins. Some of these are within the virus, but others, including the spike glycoprotein, the membrane glycoprotein, and the envelope protein are on the viral surface. In theory, any functional or structural viral gene can have an effect on the efficiency of a virus and both mutations 7 or alteration in the expression 8 between February and July 2020 13 . In this context, the most important question is to identify viral mutations leading to different patient outcomes. Mutations resulting in a mild disease could facilitate the spread of the virus and thereby maintain the outbreak. Other mutations leading to a more severe disease need immediate attention to prevent detrimental outcomes. Here, our goal was to identify and rank All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.16.20213710 doi: medRxiv preprint mutations associated with altered patient outcome by simultaneously correlating outcomes to all mutations across a large cohort of patients. All available SARS-CoV-2 (taxid: 2697049) viral nucleic acid sequences were downloaded from the GISAID virus repository (https://www.gisaid.org/). The sequences were acquired in FASTA format. Those viral sequences were selected where the entire viral nucleic acid sequence was published. A second filtering was executed to include only virus genomes with available patient follow-up status. The mutations were evaluated using the CoVsurver (https://corona.bii.a-star.edu.sg). To achieve this, the viral sequences in .FASTA format were used as the query and the "hCoV-19/Wuhan/WIV04/2019" was used as the reference. The analysis was run by using batches of 1000 samples in one run. Protein mutations don't have overlaps, and the genomic boundaries of the various proteins in the WIV04 reference genome are displayed in Table 1 . As the patient samples were annotated with all together more than sixty different outcome classification, we had to coerce these into three major categories. Patients who were "asymptomatic", were "not hospitalized", had a "mild" disease, were at "home" were all assigned to have a "mild" disease. Also patients who were treated at outpatient departments, were quarantined or were treated by the physician network were classified as "mild". Patients who definitely needed medical care were assigned to the "hospitalized" group. These include those "hospitalized", "inpatient", "discharged", "released", and "recovered". In addition, combinations of the annotations which included any of these were also assigned into this cohort (e.g. "initially hospitalized" or "to be hospitalized"). Finally, patients with detrimental outcome were allocated to the "severe" cohort. These include those "deceased", those with a "severe" disease, those who entered "intensive care units". All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.16.20213710 doi: medRxiv preprint Also any combination of these with other annotations (e.g. "hospitalized / ICU") were also added to this category. All data processing and statistical analysis steps were performed in the R statistical environment v 3.6.3. Data processing was performed on 28 th July 2020. Chi-square test was applied to examine the association between each mutation and patient status data. False discovery rate using the Benjamini-Hochberg method was computed to correct for multiple hypothesis testing and only results passing a FDR cutoff of 5% were accepted as significant. All together 73,020 SARS-CoV-2 viral nucleic acid sequences were available, and 72,331 of these included the entire viral nucleic acid sequence. Clinical data was available for 5,094 patients, and 3,184 of these had also follow-up data. This is a small fraction of the total data which implies that our findings could contain a sampling bias. When looking on the clinical parameters of these patients, 55.6% were male and 38.2% were female (remaining samples did not had this information). The geographical origin of the samples cover the entire globe: 4.8% were from Africa, 45.4% from Asia, 9.4% from Central America, 26.7% from Europe, 6.4% from North America and 6.6% from South America. Collection of the samples happened between 30.12.2019 and 4.7.2020. Of all patients with a follow-up 625 had a mild disease, 2,341 had to be hospitalized and 218 patients had a severe disease. All together 2,121 different mutations affecting the protein structure were identified, and 463 of these mutations were not present in samples with clinical follow-up. When looking on all mutations, we have identified on average 2.81 mutations in each sample. As an internal control to validate any potential bias in the mutations prevalence related to patient proportions we computed the average sample numbers for each clinical outcome cohort, All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.16.20213710 doi: medRxiv preprint and these overlapped with the actual clinical outcome (wild type and mild outcome mean = 623, wild type and hospitalized mean = 2,336, wild type and severe outcome mean = 217). When analyzing the correlation to clinical outcome across all mutations, 141 mutations reached statistical significance at FDR<5%. The complete list of these mutations with sample numbers in each cohort is displayed in Supplemental Table 1 and patient level mutation data is provided in Supplemental Table 2 . In order to concentrate only on mutations with a clinical relevance, we selected only those mutations which were present in at least 2% of the samples (this corresponds to a cutoff of at least 64 patient samples with a mutation). When looking on mutation related to mild outcome, only six mutations passed all filtering criteria -L84S in the ORF8 protein, L37F in the NSP6 protein, G196V in the ORF3a protein, F308Y in the NSP4 protein, and the S197L and S202N mutations in the nucleocapsid phosphoprotein. The complete list as well as distribution among patient samples is provided in Table 2 . When searching for mutations related to hospitalization or to severe outcome, we used the above filter of including only mutations present in at least 2% of the samples. All together nine mutations passed these criteria. These originated in seven genes: D614G and L54F in the surface (S) glycoprotein, P323L in the RNA dependent RNA polymerase, Q57H in the ORF3a protein, S194L, R203K, and G204R mutations in the nucleocapsid phosphoprotein, L177F in the 3'-to5' exonuclease, and Q496P in the NSP2 protein. In order not to miss mutation leading to deadly outcome we also included all mutations which were present in at least 10 patients with severe outcome. This additional analysis delivered two further mutations, the V1176F in the surface (S) glycoprotein and the L71F mutation in the NSP7 gene. These resulted in 27 and 11 severe outcomes after being spotted in 28 and 11 patients, respectively. Interestingly, the overall prevalence of mutations leading to mild outcome (n=1,565) was smaller than the prevalence of those leading to worse outcome (n=6,696). Nevertheless, a All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.16.20213710 doi: medRxiv preprint significant proportion of the mutations (n=5,053) were not significantly correlated to any clinical outcome. The complete list of all mutations correlated to severe disease is presented in Table 3 . We have simultaneously analyzed the correlation between patient outcome and all identified mutations resulting in amino acid sequence changes in the viral proteins. Strikingly, we have not only found a significant number of mutations, but some of these were correlated to mild diseases while other had a significant correlation to severe outcome. Nucleocapsid phosphoprotein, the protein with most significant mutations was linked to both mild and severe patient outcome. All these changes are at a close genomic positions, S197L and S202N resulting in mild outcome and R203K, G204R, and S194L resulting in inferior outcome. Interestingly, when comparing the S197L (76% of mild outcome) to the S197L (less than 1% chance of a mild outcome) variants, the relative risk was extremely high. Interestingly, the majority of the nucleocapside phosphoprotein mutations mapped to a small stretch of amino acids from position 194 to 204. This region coincides with the phosphorylated "RS-motif" 14 which maps onto the intrinsically unstructured serine rich region 181-213 of the protein 15 . Phosphorylation of this site is known to play important roles such as recruitment of host RNA helicase DDX1 which facilitates template readthrough and enables longer subgenomic mRNA synthesis (https://www.uniprot.org/uniprot/P59595). This observation needs further follow-upespecially because the nucleocapsid phosphoprotein is one of the potential drug targets against SARS-CoV-2 16 . Overall, we have observed more mutations in the structural proteins (spike and nucleocapsid phosphoprotein) than in non-structural proteins. Of note, destabilization mutations in nonstructural proteins were suggested to represent a potential mechanism differentiating SARS-CoV-2 from SARS 17 . Previously, a set of common deletions were identified in the spike protein of SARS-CoV-2 18 . Other deletions were also validated by RT-PCR 19 . However, due to the scarce number of identical changes we could not evaluate a potential link between deletions and patient outcome. Importantly, our findings might contain a sampling bias, since only a fraction of the available genomes had patient outcome data. On the other hand, five out of 17 potentially All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.16.20213710 doi: medRxiv preprint significant mutations (listed in Tables 2 and 3 ) map to an about 10 amino acid long, functionally important region of the nucleocapsid phosphoprotein which leads us to believe that the current statistical approach can reveal functionally important sites within the COVID 19 genome. Coronaviruses have generally a stable genome which changes very little over time 20 . A fundamental question of SARS-CoV-2 research is whether or not the virus can get weaker or stronger with time. Our findings suggest that there are mutations that can support either of these changes so the theoretical possibility is there that in the future the viral effect will shift towards milder or more severe patient outcomes. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted October 20, 2020. . The proximal origin of SARS-CoV-2 SARS-CoV-2: fear versus data Comparison of mortality associated with respiratory viral infections between Baseline Characteristics and Outcomes of 1591 Patients Infected With SARS-CoV-2 Admitted to ICUs of the Lombardy Region A new coronavirus associated with human respiratory disease in China Phylogenetic network analysis of SARS-CoV-2 genomes Mutational analysis of the coat protein gene of potato virus X: effects on virion morphology and viral pathogenicity Expression of measles virus V protein is associated with pathogenicity and control of viral RNA synthesis A phylogenetically conserved hairpin-type 3' untranslated region pseudoknot functions in coronavirus RNA replication Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Emerging SARS-CoV-2 mutation hot spots include a novel RNAdependent-RNA polymerase variant SARS-CoV-2 Spike protein variant D614G increases infectivity and retains sensitivity to antibodies that target the receptor binding domain. bioRxiv Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Phosphorylation of the arginine/serine dipeptiderich motif of the severe acute respiratory syndrome coronavirus nucleocapsid protein modulates its multimerization, translation inhibitory activity and cellular localization IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content Virtual screening and dynamics of potential inhibitors targeting RNA binding domain of nucleocapsid phosphoprotein from SARS-CoV-2 COVID-2019: The role of the nsp2 and nsp3 in its pathogenesis Identification of common deletions in the spike protein of SARS-CoV-2 An 81-Nucleotide Deletion in SARS-CoV-2 ORF7a Identified from Sentinel Surveillance in Arizona Genetic variability of human respiratory coronavirus OC43 No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity