key: cord-0705238-pfhfmck1
authors: Mercatelli, Daniele; Triboli, Luca; Fornasari, Eleonora; Ray, Forest; Giorgi, Federico M.
title: coronapp: A Web Application to Annotate and Monitor SARS-CoV-2 Mutations
date: 2020-09-10
journal: bioRxiv
DOI: 10.1101/2020.05.31.124966
sha: aaa937796ade0e943a249adaa61d9e56c54f5313
doc_id: 705238
cord_uid: pfhfmck1

The avalanche of genomic data generated from the SARS-CoV-2 virus requires the development of tools to detect and monitor its mutations across the world. Here, we present a webtool, coronapp, dedicated to easily processing user-provided SARS-CoV-2 genomic sequences and visualizing current worldwide status of SARS-CoV-2 mutations. The webtool allows users to highlight mutations and categorize them by frequency, country, genomic location and effect on protein sequences, and to monitor their presence in the population over time. The tool is available at http://giorgilab.unibo.it/coronapp/ for the worldwide dataset and at http://giorgilab.unibo.it/coronannotator/ for the annotation of user-provided sequences. The full code is freely shared at https://github.com/federicogiorgi/giorgilab/tree/master/coronapp Data Availability Statement The data that support the findings of this study derive from the GISAID consortium and are openly available in Github, in Rdata format for the R environment, in files results.rda and metadata.rda, at the following link: https://github.com/federicogiorgi/giorgilab/tree/master/coronapp/data

1 SARS-CoV-2 is a novel pathogenic enveloped RNA beta-coronavirus causing a 2 severe illness in human hosts known as coronavirus disease-2019 (COVID-19). The 3 predominant COVID-19 illness is a viral pneumonia, often requiring hospitalization 4 and in some cases intensive care [1] . With almost 27.5 million laboratory-confirmed 5 positive cases worldwide as of 9 September 2020 and an estimated case fatality rate 6 across 204 countries of 5.2%, COVID-19 has become a global health challenge in 7 only a few months [2] . SARS-CoV-2 infection depends on the recognition of host 8 angiotensin converting enzyme 2 (ACE2), exposed on the cell surface in human lung 9 tissues [3,4]. SARS-CoV-2 spike glycoprotein binds ACE2, mediating membrane 10 fusion and cell entry [5] . Upon cell entry, the virus subverts host cell molecular 11 processes, inducing interferon responses and eventually apoptosis [6] . 12 To date, much effort has been made to develop therapeutic strategies to limit 13 SARS-CoV-2 transmission and replication, but no treatment or vaccine has proven 14 effective against the virus, and repurposing of approved therapeutic agents has been 15 the main practical approach to manage the emergency so far [7] . As viruses mutate 16 during replication, the emergence of SARS-CoV-2 sub-strains and the challenge of a 17 probable antigenic drift require attention, especially for vaccine development [8] . 18

Although sequence analyses of SARS-CoV-2 have shown that genomic variability 19 is very low [9] , new SARS-CoV-2 mutation hotspots are emerging due to the high 20 number of infected individuals across countries and to viral replication rates [10] . 21

Three major SARS-CoV-2 clades known as clade G, V, and S have emerged, showing 22 a different geographical prevalence [10] . The most frequent mutation detected so far 23 defines the G clade and causes an aminoacidic change, aspartate (D) or glycine (G), at 24 position 614 (D614G) of the viral Spike protein [11] . 25 Continual genomic surveillance should be considered to monitor the possible 26 appearance of viral subtypes characterized by altered tropism, or causing more 27 aggressive symptoms. Constant and widespread monitoring of mutations is also a 28 powerful means of informing drug development and global or local pandemic 29 management. The Global Initiative on Sharing All Influenza Data (GISAID) has 1 collected to date (9 September 2020) over 90,000 publicly accessible SARS-CoV-2 2 sequences. The GISAID effort has made it possible to compare genomes on a 3 geographical and temporal scale and an increasing number of laboratories have started 4 to sequence COVID-19 patient samples worldwide [12, 13] . Several online tools have 5 been developed to monitor the evolution of the virus from a phylogenetic perspective, 6 such as Nextstrain [14] , or to visualize epidemiological data such as number of cases 7 and deaths [15] . However, no online tool currently exists to annotate user-provided 8 SARS-CoV-2 genomic sequences, which may derive from specific GISAID subsets 9 or from sequencing efforts of individual laboratories. Neither does any tool 10 specifically monitor the prevalence of specific SARS-CoV-2 mutations associated to 11 particular geographic regions or protein locations, nor their frequency in the 12 population over time. 

A worldwide analysis is shown, generated using data from GISAID. Specifically, we 2 processed all SARS-CoV-2 complete (>29,000 sequenced nucleotides) genomic 3 sequences, excluding low-quality sequences (>5% undefined nucleotide "N") and 4 viruses extracted from non-human hosts. 5

The underlying database is updated weekly, and we provide the date of the last 6 version as a reference for studies based on the data provided. We indicate the number 7 of samples processed and the total number of mutational events detected (Figure 1 A) . 8

We also show the number of distinct mutated loci. Currently, this number is slightly 9 below 20,000, meaning that two thirds of the original Wuhan SARS-CoV-2 genome 10 has been affected by mutations and/or sequencing errors (the full length of the 11 reference genome is 29,903 nucleotides, based on sequence id NC_045512.2). The table shows every mutation in a specific geographical area, reporting: 3

• the GISAID sample ID (useful for cross-reference with the GISAID database 4 and other analyses based on it, e.g. Nexstrain). 5

• The country where the sample was collected. 6

• The position of the mutation, on the reference genome (refpos) and on the 7 sample (qpos). 8

• The sequence at the mutation site, on the reference genome (refvar) and on the 9 sample (qvar). 10

• The length of the sample genome (qlength); the reference genome is 29,903 11 nucleotides long. 12 o SNP_silent: a change of one or more nucleotides with no effect in 1 protein sequence. • The overall mutations per sample, indicating the distributions of mutations per 1 sample. It has been previously reported [10] that the current mode for 2 mutation number compared to the reference NC_045512.2 genome is 7.5. 3

• The most frequent events per class. Classes are the same as reported in the 4 mutation table and are described in the previous paragraph. 5

• The most frequent events per type. Individual mutation types are shown as 6 specific nucleotides events, e.g. cytosine to thymidine transitions (C>T), 7 guanosine to thymidine transversion (G>T) or even multinucleotide mutations 8 (e.g. GGG>AAC, observed in the Nucleocapsid protein). As reported before, and an example input FASTA containing 12 sequences is provided (Figure 3 B) . The 9

analysis is almost instantaneous and shows an overall breakdown of the most mutated The webtool coronapp has been developed using the programming language R and is 1 based on a Shiny server (current version 1.4.0.2) running on R version 3.6.1. The app 2 is based on two distinct files, server.R and ui.R, managing the server functionalities 3 and the browser visualization processes, respectively. The results visualization utilizes 4 both basic R functions and Shiny functionalities; for tooltip functionality, coronapp 5 uses the R package googleVis v0.6.4, which provides an interface between R and the 6 Google visualization API [19] . 7

The core of the annotation of the user-provided sequences rests in the NUCMER 8 (Nucleotide Mummer) alignment tool, version 3.1 [20] . Nucmer output is processed 9 by UNIX and R scripts provided in Github within the server.R file. types. This analysis is also available both for worldwide-precomputed and user-input 7 analyses. 8 

Clinical 2 Characteristics of Coronavirus Disease 2019 in China

Intensive care 5 management of coronavirus disease 2019 (COVID-19): challenges and 6 recommendations

Angiotensin-converting 9 enzyme 2 (ACE2) as a SARS-CoV-2 receptor: molecular mechanisms and 10 potential therapeutic target

Master Regulator Analysis of 13 the SARS-CoV-2/Human Interactome

Characterization of spike 16 glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity 17 with SARS-CoV

Imbalanced Host Response to SARS-CoV-2 Drives Development of 21 COVID-19

A 24 Review of SARS-CoV-2 and the Ongoing Clinical Trials

Emergence of Drift Variants 27

That May Affect COVID-19 Vaccine Development and Antibody Treatment

Genomic variance of the 2019 nCoV coronavirus

Geographic and Genomic Distribution of 32 SARS-CoV-2 Mutations

Comparative genomics suggests 35 limited variability and similar evolutionary patterns between major clades of 36 SARS-CoV-2. BioRxiv

Spread of SARS-CoV-2 in the Icelandic Population

Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United 42

Nextstrain: real-time tracking of pathogen evolution

Statistical characteristics of amino acid covariance as 6 possible descriptors of viral genomic complexity

Structure of the SARS-CoV-2 9 spike receptor-binding domain bound to the ACE2 receptor

SARS-CoV-2 viral spike G614 mutation exhibits 12 higher case fatality rate

Using the Google visualisation API with R

Using MUMmer to Identify Similar 17 Regions in Large Sequence Sets