key: cord-0652855-jpu1ld8y authors: Svoboda, William; McManamon, Brendan; Schwartz, Sara title: Replication of SARS-CoV-2 mutation analysis suggests differences in per-protein mutation characteristics date: 2021-12-14 journal: nan DOI: nan sha: b69fe4c088a1e393911b562cc9e9f9c20a9f7732 doc_id: 652855 cord_uid: jpu1ld8y The increasing spread of COVID-19, caused by the virus SARS-CoV-2, raises concerns about the extent to which mutations have occurred across the viral genome. We present a partial replication of an earlier 2021 study by Wang, R. et al. that determined the presence of four substrains and eleven top mutations in the United States. We analyze a portion of the authors' data set in order to recreate Figure S1 from the paper, recapitulating the same features observed in the original figure. We further generate a summary of mutation characteristics for each of the 26 named proteins and confirm the significance of the spike protein at roughly 24% of all recorded mutations. Our analysis suggests that additional factors may affect per-protein mutation rate besides protein length. The spread of coronavirus disease 2019 (COVID-19) continues to be of scientific and public health interest in the context of the ongoing global pandemic. Of particular concern is whether the underlying virus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has become more infectious as a result of new mutations. A 2021 paper by Wang, R. et al. 1 analyzed 45,494 complete SARS-CoV-2 genome sequences in order to analyze their mutations. Of the 12,754 sequences from the United States, the authors determined the presence of four substrains and eleven top mutations. Furthermore, the study concluded that two of these substrains had become more infectious given the mutations found on the spike protein. We present a partial replication of the original study in order to validate its findings. A subset of the data used in the paper was processed in order to recreate selected figures (Figure 1 ). Further analysis was also performed in order to summarize mutation characteristics for each protein across the entire SARS-CoV-2 genome. Finally, we discuss the implications of these findings on our current understanding of the SARS-CoV-2 viral makeup. arXiv:2112.07770v2 [q-bio.GN] 24 Dec 2021 As part of the original study 1 , a website called Mutation Tracker was created to report the distribution of unique single mutations across the entire SARS-CoV-2 genome. A screen shot of this tracker is included in the paper's 1 supplementary information ( Figure S1 , included as Figure 1 in this report). We downloaded the summary data provided from the website, itself generated by Wang, R. et al. using sequence data from GISAID (https://www.gisaid.org/). The summary data is updated automatically by the mutation website, with the totals from 31 October 2021 being used in this paper. Split among 26 different files for each named protein, the summary data was reprocessed into a single data frame representing the whole viral genome. As Figure 2 shows, we replicated the original distribution graph using this updated data. It is evident that NSP3 and the spike protein are both among the largest proteins and have experienced a high frequency of unique mutations across their domains. We also observe that ORF7a, ORF7b, and ORF8 all have relatively high mutation frequencies, in spite of their short length. To quantify our interpretations, we computed further statistics to summarize the mutation distribution. From our summary statistics, we generated a graphical breakdown of the total mutation percentage. Figure 3 displays this data for the top 7 proteins between 20 October 2021 and 31 October 2021. Following from our analysis of the summary statistics, the spike protein made up roughly 24% of the total recorded mutation frequency. The graph changes little between the 10 days comprising each time point. Of note is that a full 13 proteins make up less than 1% each of the total. Our analysis successfully replicated the original plot of the mutation distribution represented by Figure 1 . Our version, Figure 2 , makes use of the updated data provided from the Mutation Tracker website and recapitulates the features observed there. Along with the summary statistics from Table 1 , we corroborated the significance of the spike protein in the current pattern of SARS-CoV-2 mutation. Our approach is limited by the scope of the data used in the study. The Mutation Tracker website only allows downloading summary data from the most recent time point. While we had preserved an earlier archive from 20 October 2021, a more thorough comparison than that seen in Figure 3 would use a greater range of dates. Further study into the change in mutation over time, either by sampling the tracker or by directly processing the source data from GISAID, could provide more insight. The summary statistics from Table 1 suggest differences in mutation characteristics between the 26 studied proteins. Given our observation of relatively more mutations in shorter proteins, a conformational analysis could further assess the influence of a protein's size and 3D structure on the number of mutations it experiences. There is also interest in determining the extent, if any, of evolutionary pressure on mutation rate. Some of the proteins comprising the lowest percentages of the total mutation frequency, such as NSP11 and the envelope protein, are known to have important roles in the viral life cycle 2,3 . Future work could determine if proteins with functions in viral replication versus internal regulation experience more mutations. In this paper, we presented a partial replication of an earlier study by Wang, R. et al. 1 . We recapitulated their visualization of the genome-wide SARS-CoV-2 mutation distribution. We further analyzed the underlying data in order to summarize the influence of each protein on the overall mutation frequencies. The current global pandemic remains one of the most significant challenges to public health in recent memory. In light of emerging variants of SARS-CoV-2, it is even more important to understand how and where the virus mutates. The original paper 1 used 45,494 complete genome sequences of SARS-CoV-2 strains from infected invdividuals. These sequences were sourced from the GISAID (https://www.gisaid.org/) database up to 11 September 2020. For this analysis, we used the 31 October 2021 summary data from the Mutation Tracker website. The tracker is continually updated with the latest data from GISAID. Data preprocessing was accomplished using R 4 and the dplyr 5 package. The 26 .csv files for each named protein were parsed and combined into a single data frame in order of position on the SARS-CoV-2 genome. Mutation frequencies were mapped to the corresponding protein name using nucleotide positions sourced from the NIH (https://www.ncbi.nlm.nih.gov/gene/). The plot of the genome-wide SARS-CoV-2 mutation distribution ( Figure 2 ) was generated using the Plotly package for R 6 . The stacked bar chart comparing per-protein mutation percentages (Figure 3 ) was generated using the ggplot2 package 7 using the same data as the mutation distribution plot (summarized for each protein). This paper uses the summary data from the Mutation Tracker website up to 31 October 2021. This data was originally sourced from GISAID (https://www.gisaid.org/). All code for this paper is attached in the appendix and is also available on the author's (W.S.) GitHub at https://github.com/thisstillwill/QCB455-Fall-2021. conducted multiple sequence alignment for each protein and calculated percent similarity in order to replicate Figures S5, S7 , S8, S10, and S13. Additionally, B.M. added a table summarizing substitution mutation frequencies. S.S. conducted multiple sequence alignment and single-nucleotide polymorphism (SNP) calling on the complete genome sequences to create SNP profiles for each genome, and used a Jaccard distance-based representation of the SNP variants as input features into a K-means clustering algorithm. S.S. plotted the distributions of the resulting clusters on a U.S. map to replicate Figure 1 from the original paper. All authors have read and agreed to the published version of the manuscript. Sievert, C. Interactive web-based data visualization with r, plotly, and shiny. (Chapman; Hall/CRC, 2020). scatter_palette <-c("#a0c8ff", "yellow", "#0d8f81", "#774ad6", "#eb4a40", "#3e6db2", "#a0c8ff", "#fdae6b", "#774ad6", "#a0c8ff", "black", "#0d8f81", "#774ad6", "#eb4a40", "#3e6db2", "#a0c8ff", "black", "#04c8ff", "#a0c8ff", "yellow", "#0d8f81", "#774ad6", "#eb4a40", "#3e6db2", "#a0c8ff", "#fdae6b") # Create scatter plot of mutations scatter_fig <-plot_ly(data = mutation_summaries_combined, x =~position, y = log(frequency), → type = "scatter", mode = "markers", color =~protein, colors = scatter_palette) scatter_fig <-scatter_fig %>% layout(title = "Mutation distribution across the SARS-CoV-2 genome", xaxis = list(title = "", → showgrid = FALSE, tickangle = 60, tickmode = "array", nticks = 11, tickvals = c(536, → 5638, 9305, 10514, 11408, 14839, 17138, 23474, 25807, 26359, 28904), ticktext = c("NSP1", "NSP3", "NSP4", "3CL", "NSP6", "RdRp", "Helicase", "S", "ORF3a", "E", "N")), yaxis = list(title = "ln(Frequency)", showgrid = FALSE), font = list(family = "Serif", size = 32, color = "black"), showlegend = FALSE, margin = list(t = 200)) save_image(scatter_fig, "./figures/updated_distribution.pdf", width = 2304, height = 1152) → knitr::include_graphics("./figures/updated_distribution.pdf") # Generate mutation statistics by protein aggregate_summaries <-mutation_summaries_combined %>% group_by(protein) %>% col.names = c("protein", "total mutations", "% total mutations", "protein length", "mutations/nucleotide"), caption = "Summary statistics for each SARS-CoV-2 protein (author W.S.)", → format.args = list(big.mark = ",")) # Read mutation summary data for 10202021 path <-"./data/MutationSummary_10202021/" files <-dir(path) mutation_summaries_10202021 <-do.call(rbind, lapply(paste0(path, files), read.csv)) # Find percent mutations by protein for 10202021 bar_data_10202021 <-mutation_summaries_10202021 %>% mutate(position = as.integer(gsub("[ˆ0-9]", "", mutation_site))) %>% group_by(position) %>% summarise(frequency = sum(Total_frequency)) bar_data_10202021$protein <-name_map[bar_data_10202021$position] bar_data_10202021 <-subset(bar_data_10202021, protein != "Others") bar_data_10202021 <-bar_data_10202021 %>% group_by(protein) %>% summarise(total_mutations = sum(frequency)) %>% mutate(raw_percent = (total_mutations/sum(total_mutations)) * 100) %>% arrange(desc(total_mutations)) %>% select(protein, raw_percent) %>% head(7) others <-list("Others", 100 -sum(bar_data_10202021$raw_percent)) bar_data_10202021 <-bar_data_10202021 %>% rbind(others) %>% arrange(raw_percent) %>% mutate(date = "10202021") # Find percent mutations by protein for latest data (10312021) bar_data_10312021 <-aggregate_summaries %>% select(protein, raw_percent) %>% head(7) others <-list("Others", 100 -sum(bar_data_10312021$raw_percent)) bar_data_10312021 <-bar_data_10312021 %>% rbind(others) %>% arrange(raw_percent) %>% mutate(date = "10312021") # Combine data into single data frame for plotting bar_data <-rbind(bar_data_10202021, bar_data_10312021) # Created stacked bar chart for each date ordered by percent stacked_bar <-ggplot(bar_data, aes(x = date, y = raw_percent, group = raw_percent)) + geom_col(aes(fill = protein), width = 0.9) + scale_fill_brewer(type = "qual", palette = "Paired") + geom_text(aes(label = paste0(protein, " -", round(raw_percent, digits = 1), "%")), position = position_stack(vjust = 0.5)) + ggtitle("Percentage of total mutations by protein") + → xlab("Date") + ylab("Percent") + labs(fill = "Protein") + theme_classic(base_family = "serif") + → theme(plot.title = element_text(hjust = 0.5)) ggsave("./figures/stacked_bar.pdf") knitr::include_graphics("./figures/stacked_bar.pdf") # Cite packages knitr::write_bib(c(.packages(), "bookdown"), "packages.bib") sessionInfo(package = NULL) Analysis of SARS-CoV-2 mutations in the united states suggests presence of four substrains and novel variants Structural biology of the arterivirus nsp11 endoribonucleases Coronavirus envelope protein: Current knowledge R: A language and environment for statistical computing. (R Foundation for Statistical Computing Dplyr: A grammar of data manipulation Plotly: Create interactive web graphics via plotly.js ggplot2: Create elegant data visualisations using the grammar of graphics Bookdown: Authoring books and technical documents with r markdown formatR: Format r code automatically Knitr: A general-purpose package for dynamic report generation in r Reticulate: Interface to python Rticles: Article formats for r markdown Bookdown: Authoring books and technical documents with R markdown Elegant graphics for data analysis Dynamic documents with R and knitr A comprehensive tool for reproducible research in R. in Implementing reproducible computational research The authors are grateful to Rui Wang and Guo-Wei Wei for their correspondance providing the color values used in the Mutation Tracker and describing the approach used to clean the summary data. W.S. downloaded and preprocessed the summary data from the Mutation Tracker website. W.S. plotted the mutation frequencies from the summary data to replicate Figure S1 from the original paper. W.S. conducted further analysis of the dataset to summarize the mutation data for each protein. The authors declare no competing interests.summarise(total_mutations = sum(frequency)) %>% mutate(raw_percent = (total_mutations/sum(total_mutations)) * 100, nucleotide_length = sapply (