key: cord-294295-sd5893ii authors: Badua, Christian Luke D.C.; Baldo, Karol Ann T.; Medina, Paul Mark B. title: Genomic and Proteomic Mutation Landscapes of SARS‐CoV‐2 date: 2020-09-24 journal: J Med Virol DOI: 10.1002/jmv.26548 sha: doc_id: 294295 cord_uid: sd5893ii The ongoing pandemic caused by a novel coronavirus, SARS‐CoV‐2, affects thousands of people every day worldwide. Hence, drugs and vaccines effective against all variants of SARS‐CoV‐2 are crucial today. Viral genome mutations are commonly existent which may impact the encoded proteins, possibly resulting to varied effectivity of detection tools and disease treatment. Thus, this study surveyed the SARS‐CoV‐2 genome and proteome and evaluated its mutation characteristics. Phylogenetic analyses of SARS‐CoV‐2 genes and proteins show three major clades and one minor clade (P6810S; ORF1ab). The overall frequency and densities of mutations in the genes and proteins of SARS‐CoV‐2 were observed Nucleocapsid exhibited the highest mutation density among the structural proteins while the Spike D614G was the most common, occurring mostly in genomes outside China and USA. ORF8 protein had the highest mutation density across all geographical areas. Moreover, mutation hotspots neighboring and at the catalytic site of RNA‐dependent‐RNA‐polymerase were found that might challenge the binding and effectivity of remdesivir. Mutation coldspots may present as conserved diagnostic and therapeutic targets were found in ORF7b, ORF9b and ORF14. These findings suggest that the virion's genotype and phenotype in a specific population should be considered in developing diagnostic tools, and treatment options. This article is protected by copyright. All rights reserved. COVID-19 presented with pneumonia-like symptoms surfaced from a seafood market at Wuhan, Hubei Province in China in December 2019, and has since spread across the globe. 1 According to the WHO, it has affected 213 countries and territories with 23,057,288 people infected and 800,906 deaths worldwide. 2 Mitigation of this public health crisis can be accomplished through effective public health safety protocols, vaccines, and targeted viral treatment. The scientific community has then been in haste to develop vaccines and therapeutic drugs to combat the COVID-19. classifying this clade were identified in five China-derived samples, while P6810S has not been identified in current literature. The genomic mutation profile of SARS-CoV-2 was evaluated, and the distribution of the mutations across the viral genomes from different geographical areas is summarized in Figure 3A . Overall, in Total, 674 nucleotide mutations were identified using genome samples collected from December 2019-May 2020 (Table 1) . Generally, mutation frequencies among the geographical areas followed 3:1 transition to transversion ratio ( Figure 3B , 3C), in which the C>T substitution was most common (44.7%), followed by T>C (13.95%). Interestingly, ORF3a and 3'UTR genes had higher transversion density than transition similar between the two timepoints; while G>T transversion (10.83%), was the third most frequently occurring nucleotide change. Altogether, approximately similar proportions of nucleotide change types were observed between genomes among the geographical areas collected from December-March 2020 versus December-May 2020 ( Figure 3B , 3C). These findings may suggest that the genomic mutation characteristics of SARS-CoV-2 from the earlier timepoint may not be significantly varied from the later period (e.g. between March to May 2020). Among the SARS-CoV-2 genomic regions, the untranslated regions (UTRs) yielded the highest mutation density, with 7.5x10 -3 mutation density at the 5'UTR and 2.5x10 -2 mutation density at the 3'UTR among all geographical areas, for both timepoints ( Figure 3D, 3E) . Notably, indels were found mostly at the UTRs. As shown in Figure 4B , no UTR mutations were common among all areas, while mutations common between USA and Others are at 5'UTR (241C>T) and 3'UTR (29742G>T and 29870C>A); and between China and Others, 26delA and 28C>T at the 5'UTR were common. Overall, the UTRs are consistently densely mutated suggesting that these genome regions are mutation prone regions of the SARS-CoV-2 genome. The impact of overall genomic mutation characteristics in the viral proteins were then investigated from the genomic data and the description of these will be according to geographic area and will be magnified towards the differences between the two time points. Most of the nucleotide mutations in the SARS-CoV-2 genome (62.01%) lead to missense mutation in their proteins. Genome reference positions or nucleotide mutation hotspots 11083 (ORF1ab; nsp6), 26144 (ORF3a) and 28144 (ORF8) were common among all geographical areas ( Figure 4A ). Most of the amino acid substitutions in China were "Polar  Neutral" changes (66.67%) for the first time point, while this proportion decreased at the second time point (57.14%), with an addition of deletion mutations (1.43%). There was also an increase in substitutions where residues had a "Similar Change" in nature (e.g. Valine  Isoleucine) (18.52% -12/2019-03/2020; 31.43% -12/2019-04/2020). These data could be seen in Figure 5B . Furthermore, the mutation hotspots based on mutation densities also changed in China, where mutations in the Spike glycoprotein, Protein 3a, Membrane protein, ORF6 protein and ORF10 protein appeared in the second time point ( Figure 5C ). In the USA, the proportions of the type of amino acid substitutions did not change drastically ( Figure 5 ). "Polar  Neutral" mutations were almost similar between the two time points (36.36%12/2019-03/2020; 36.57% -12/2019-04/2020), while "Similar change" mutations changed minimally (46.75%12/2019-03/2020; 47.01% -12/2019-04/2020).). "Similar change" mutations had the highest frequency among the mutation types in USA samples. Mutation density presented in bar graphs show that there was an appearance of amino acid substitutions in the M and ORF7a proteins ( Figure 5C ). For the Others geographic area, there is a great change in the proportions of mutations that are "Polar  Neutral", "Charged  Polar", and "Charged  Neutral" ( Figure 5B ). The proportion of "Polar  Neutral" mutations in the earlier time point was higher than that of the second time point (31.71%  22.49%) as shown in Figure 5B . The proportion of "Charged  Polar", and "Charged  Neutral" mutations increased between the two time points (4.88%  7.10% "Charged  Polar"; 4.88%  18.34% "Charged  Neutral"). Appearance of mutations in the M protein and ORF6 protein occurred in the Others geographic area according to the mutation density graphs ( Figure 5C ), with the appearance of "Similar Change" substitutions in the second time point for the ORF8 protein and N protein as compared to the initial time point ( Figure 5C ). The total count of amino acid substitutions in the proteins of SARS-CoV-2 was 381. From the data that was collected at the earlier timepoint, most of the mutations in the proteins were classified under "Similar Change" (44.83%), while insertions were the least frequent (1.15%). In addition of data from the later study timepoint, "Similar Change" mutations were most frequent however with decreased proportion (43.45%); and insertions was also the least frequent then at 0.84% proportion. The breakdown of the mutations in the SARS-CoV-2 proteins based on the collected genomes are shown in Figure 5A . Overall, there was an observed shift in the proportions of the different classes of amino acid mutations between the two collection periods and geographic areas. There was an increase in proportion of "Similar Change" mutations in China between the two collection periods, while deletion mutations emerged at later time ( Figure 5A ). In comparison with USA, the proportion of the classes of amino acid mutations were generally unchanged. Prominent mutations have been found and further evaluated in this study in a spacio-temporal perspective, which involve both structural and non-structural proteins of SARS-CoV-2. In samples from China, the D614G substitution did not occur, in both time points ( Figure 5A ), however in the USA samples, there was an increase in the frequency of the D614G mutation (n D614G = 1  n D614G = 8) ( Figure 5A ). The same pattern was seen in the "Others" Geographic area (n D614G = 4  n D614G = 18). The mutation density of the Spike Glycoprotein increased in all of the geographic areas (China, USA and Others areas, based on Figure 5C ). The D614G substitution in the Spike protein (S) occurred five times in the sample population from the data collected at earlier time and occurred 26 times from the overall Total data. This mutation occurred with the P4715L (ORF1ab) mutation ( Figure 2B , Figure 5A ). The D614G is a result of a transition mutation in the S gene of SARS-CoV-2 (23403A>G) and classified as "Charged<>Neutral" aa mutation. The mutation density S based on earlier data was 0.01414 mutation events/aa length of S glycoprotein, while this value approximately doubled based on the overall data. Additionally, 4 other hotspots in the Spike protein were detected in this study (Table 1) . These data may suggest that the S variant occurred outside of China and is more observed in separate countries and in the USA. Among the geographical areas, no mutations were found in ORF6, ORF7a/7b, ORF9b, ORF10 and ORF14 proteins by the earlier study timepoint, hence considered as coldspots at that period ( Figure 3D and 5B). On the other hand, at the later time point, only ORF7b, ORF9b and ORF14 proteins were identified as mutation coldspots ( Figure 3E and 5B). Note that it may be due to limitations in annotation of various viral genome regions that no mutations were detected in ORF9b and ORF14 proteins, as the study based the identification of genes and proteins from publicly-available annotation to reference sequence (NCBI GenBank TM Accession ID: NC_045512). All in all, the ORF7b gene/protein was observed to have no mutations in all geographical region and between the study timepoints, therefore this gene may be potentially conserved in SARS-CoV-2. Prominently, ORF8 protein presented the highest mutation density among non-structural proteins (0.223 mutations/aa site in overall Total), similar in all geographical areas similar in two timepoints ( Figure 3D , 3E, 5B). Collectively at the later timepoint, its mutation density almost doubled. Along with the increased in mutation densities in other notable sites: doubled in nsp3 (0.072 mutations/aa site by March-0.136 mutations/aa site by May), and quadrupled in the RNA-dependent-RNA polymerase (RDRP) (0.0139 mutations/aa site by March-0.0515 mutations/aa site by May). The recurrence of ORF8 mutations were attributed to L84S which consistently was the most frequently occurring in China and USA ( Figure 3A , 5A). In Others, however, the most recurrent mutation varied that was G251V in protein 3a in earlier timepoint, while P4715L in RDRP by the later timepoint ( Figure 3A , 5A). This may suggest that the distinctive abundance of ORF8 mutations is generally similar among different areas, as its collective frequency increases over time. For both time points, N had the highest mutation density (0.02148 for earlier data; 0.1122 for overall data). Twelve nucleotide sites considered as hotspots in N, comprising 48% of the mutations in N (Table 1) . Mutation densities of the other structural proteins are shown in Figure 5B . Interestingly, 10 SARS-CoV-2 samples had a substitution mutation in nucleotide positions 28881-28883 (GGG>AAC). This nucleotide mutation led to two amino acid substitutions (R203K and G204R). The earliest recorded SARS-CoV-2 genome having this mutation was from Florida, USA (February 28,2020; Accession ID: MT276330) while the other 9 genomes that have this mutation come from the "Others" geographic area (Israel, Peru, Brazil, Greece, Czech Republic, and Argentina). However, the order of mutation densities of structural proteins among geographic areas varied, with the Others geographic area having N as the 3 rd highest mutation density for the overall data ( Figure 5C ). These suggest that the mutation in the N protein did not occur initially in China but occurred first from the USA. Nsp16 is responsible for the mRNA capping of the coronavirus genome, primarily to protect from host recognition. 16 According to the crystal structure of nsp16, the domain of P6810 in nsp16 is unknown, however it is characterized as part of a bend in nsp16. Proline exhibits conformational rigidity projected to result to a kink; its substitution may cause a change in the steric conformation of the aforementioned bend. 16 Additionally, one of the immediate surrounding amino acids of nsp16-nsp10 complex that is proximal to P6810 is a tryptophan at aa position 7029 of ORF1ab. 16 Substitution of serine (P6810S) might exhibit an enhanced interaction for hydrogen bonding with tryptophan. 17,18 There is a need to further investigate this mutation to determine its significance in host evasion. It is important also to further evaluate its prevalence in the Chinese population, and in the global population to fully understand its implications in the function of nsp16. The increased recurrence of L84S mutation may suggest that this variant might be favorable for virus' survival across geographical regions. 6, 19 The subclades of L84S have mutations that may affect viral replication, immune evasion, viral release, and virion assembly. 1,20-23 Further research may ascertain the changes in the function of ORF8 due to this mutation, in virus replication, as well as potential changes in immune evasion and viral release. Observations in this study are consistent with the general pattern where transitions are more prevalent over transversions, perhaps due to steric considerations. 24-25 Interestingly, mutations in ORF3a (modulating host immune response), and 3'UTR (RNA stability and translation) consists largely of transversions, suggesting that these regions may be more erroneous than other regions and more prone to random substitution of transversions. 24 This might suggest that there are changes in virulence and replication stability across global regions. Differences in findings may be observed based on previously published literature, using the mutation landscape of SARS-CoV-2. A study by Pachetti and colleagues (2020) described that a mutation in RDRP (nt14408) increased in count, 7 (February) to 10 (cumulative by March). 25 This was consistent with this study's findings with greater recurrence; 4 occurrences (March) to 26 (cumulative by May). Additionally, another research by Kim et al. (2020) also described the low frequency of mutations in E, M, and ORF7a, similar by this study's result. 26 Other studies such as this described high frequency of mutations in ORF1ab and may be attributed to the relatively high genome length of the region. To address this, this study normalized the factor of gene length and presented the data through mutation densities of each gene in SARS-CoV-2. Discrepancies in mutation frequencies between this study and that of Tiwari and Mishra may be attributed to the following reasons: (1) In this study, a single frequency of a mutation is already considered a valid mutation. In contrast to Tiwari and Mishra's study, mutations should occur at least three times before these were considered as legitimate mutations. (2) Since the samples considered in this study were collected at a later time during the pandemic, thus providing more time and opportunity for the virus to accumulate mutations. In contrast to Tiwari and Mishra's study where samples were collected earlier into the pandemic, less time for the virus to accumulate mutations. 27 Remdesivir is currently at phase 3 of COVID-19 clinical trials, which is known to inhibit RDRP. 28 The active component of remdesivir (GS-441524; adenosine nucleotide analog) binds to RDRP catalytic site and halts nucleic acid elongation. 29 The missense mutation (D722Y) occurred at the catalytic site along with neighboring variants (V472D, L469S), a change from an acidic to a non-polar residue, may potentially result to increase in hydrophobicity at the region, leading to a more elusive conformation. This potential impact may significantly influence the RDRP conformation which might challenge the effectivity of remdesivir. 30 Hence, SARS-CoV-2 RDRP mutations, especially considering regional variability, should be further investigated on their potential effect on RDRP structure and function to support the use of remdesivir. The absence of D614G mutation in China while it was abundant in the Others geographic area suggest potentially variable effectiveness of vaccines and neutralization factors that target the RBD among different geographic areas. Alternatively, relatively conserved regions in spike heptad 1-heptad 2 repeats, may present as potential drug or vaccine targets, inhibiting viral entry. As shown in this study, mutations in the spike protein could confer alterations in its domains which may be involved in epitope recognition (i.e. RBD, S1-N terminal domain) of neutralizing antibodies (nAbs). 31-32 Hence, binding of the potential nAb with putative SARS-CoV-2 epitopes may be hindered. Further studies should be done to evaluate putative effectiveness of neutralizing monoclonal antibodies against SARS-CoV-2. The changes in the mutation frequencies and densities in N imply that the evolution of the genes and proteins of N over time in different landmasses is beneficial for the adaptation of SARS-CoV-2 as it spreads globally. 32 Currently, the WHO, and the CDC recommend the use of N1 and N2 genes in COVID-19 surveillance. 33 Recent publications have criticized the use of these genes in COVID-19 diagnosis using reverse transcriptase-polymerase chain reaction (RT-PCR) because of its relatively high mutation index. [34] [35] There are variants that fall in the forward primer for N3, and in the reverse primer of N1, (nt 28688). 36 This was a hotspot mutation in the genome and proteome of SARS-C-oV-2, as observed in this study. These support that the variations in N may pose difficulties in diagnosis using N-targeted primers for qRT-PCR. The SARS-CoV-2 genomes used in this study are assumed to have come from individuals undergoing COVID-19 testing and before any of them received antiviral treatment. Since SARS-CoV-2 genomes from individuals who have received antiviral treatment are not currently available, comparisons on the mutation patterns between these two groups cannot be determined yet, but speculations can be made. Mutations in the virus can exist and persist in the absence of selective pressure, therefore the diversity of mutations is high and no variants exist with unusually high frequencies. This is likely the phenomenon we have observed, with a few exceptions like L84S (ORF8), D614G (S), and L3606F (ORF1ab). However, antiviral drugs can serve as selective pressure against certain types of mutations in the viruses, possibly reducing the overall diversity of the virus, but at the same time, increasing the frequencies of a select few virus variants that are resistant to the antiviral drug. These variants may be more dominant in the population and this may affect the overall patterns and frequencies of mutations in SARS-CoV-2. In conclusion, this study highlights the importance of the characterization of both nucleotide and amino acid mutation landscape in SAR-CoV-2 to identify hotspots and coldspots that may be significant in the effectivity of diagnostic tools and treatment options for COVID-19, over the different areas worldwide as the pandemic continues. The authors would like to acknowledge and give their sincere gratitude towards the scientific community involved in providing, curating, and disseminating data for SARS-CoV-2 genomes, particularly GISAID and NCBI. The authors would also like to acknowledge QIAGEN for their generous action of providing free access for a limited time to their Genomics Workbench Application. Additionally, the University of the Philippines College of Medicine MD-PhD Batch 10, Baldo, J. Asi, R., Batulan, R., are to be acknowledged for their aid in checking the readability and comprehensibility of the paper. Finally, the authors would like to acknowledge all the health-care workers and frontline-support workers involved in the response to the COVID-19 pandemic, in our institution, and in our country. The authors of this study declared that this work has received no financial support. The authors declare that there are no conflicts of interest regarding this study. CB, KB, and PM designed this study. CB and KB equally contributed to data collection, data analysis, technical graphics and processing and writing the paper. PM contributed to critical evaluation of the figures and results, and the critical review of the manuscript. All authors contributed to revising the manuscript and approving of the final version submitted. The data that support the findings of this study are publicly available in NCBI GenBank at https://www.ncbi.nlm.nih.gov/nucleotide/ and in GISAID EpicCoV at https://www.gisaid.org/. SamplesSamples under the geographic cluster "China" are colored red, blue for sequences under the geographic cluster "USA", while for the "Others" geographic cluster, these are colored black. A new coronavirus associated with human respiratory disease in China Characterization of Nucleotide Mutations in SARS-CoV-2. SARS-CoV-2 genomes were identified independently, and mutations were considered to occur spontaneously. Mutations were identified by identifying substitutions in the SARS-CoV-2 reference genome NCBI GenBankTM Accession ID: NC_045512. (A) Nucleotide mutation frequency plot in Total (overall), and in geographical clusters: China, USA and Others. (B) Proportion of the nucleotide mutation types in SARS-CoV-2 genomes submitted on