key: cord-0715670-nq0i7ztx
authors: Rueca, Martina; Giombini, Emanuela; Messina, Francesco; Bartolini, Barbara; Di Caro, Antonino; Capobianchi, Maria R.; Gruber, Cesare E. M.
title: ESCA pipeline: Easy-to-use SARS-CoV-2 genome Assembler
date: 2021-05-21
journal: bioRxiv
DOI: 10.1101/2021.05.21.445156
sha: e622a25503f1fe283ec07a11cc23e21094b56d4c
doc_id: 715670
cord_uid: nq0i7ztx

Early sequencing and quick analysis of SARS-CoV-2 genome are contributing to understand the dynamics of COVID19 epidemics and to countermeasures design at global level. Amplicon-based NGS methods are widely used to sequence the SARS-CoV-2 genome and to identify novel variants that are emerging in rapid succession, harboring multiple deletions and amino acid changing mutations. To facilitate the analysis of NGS sequencing data obtained from amplicon-based sequencing methods, here we propose an easy-to-use SARS-CoV-2 genome Assembler: the ESCA pipeline. Results showed that ESCA can perform high quality genome assembly from IonTor-rent and Illumina raw data, and help the user in easily correct low-coverage regions. Moreover, ESCA includes the possibility to compare assembled genomes of multi sample runs through an easy table format. Script and manuals are available on GitHub: https://github.com/cesaregruber/ESCA

Whole genome sequence NGS have reached a pivotal role in emerging infectious diseases field, i.e. enhancing development ca-pacity of new diagnostic methods, vaccines and drugs [1, 2] , and a key role have been recognized to sequence data production and sharing in outbreak response and management [3] [4] [5] . In the cur-rent COVID-19 epidemic, more than one million of full genome sequences of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have been deposited in publicly accessible data-bases within one year (i.e. GIDAID) [6, 7] . SARS-CoV-2 genomes surveillance at global scale is permitting a real-time analysis of outbreak, with a direct impact on the public health response. This contribution comprehends the spread tracing of SARS-CoV-2 over time and space, evidencing emerging variants that may influence pathogenicity, transmission capacity, diagnostic methods, therapeutics, or vaccines [8] [9] [10] [11] . Recently divergent SARS-CoV-2 vari-ants are emerging in rapid succession, harboring multiple deletions and amino acid mutations. Some mutations occur in receptor-binding domain of the spike protein and are associated with an increase of ACE2 affinity as well as to hamper recognition by pol-yclonal human plasma antibodies [12, 13] .

The growing contribu-tion of sequence information to public health is driving global investments in sequencing facilities and scientific programs [14, 15] . The falling cost of generating genomic NGS data provides new chances for sequencing capacity expansion; however many labora-tories have low sequencing capacity and even a lack of data elaboration experties.

While the sequencing runs can be performed without a consolidated experience in infectious diseases field, the virus genomic se-quences assembly is often a demanding task.

Translating SARS-CoV-2 raw reads data int reliable and informative results is com-plex, and requires solid bioinformatics knowledge, particularly for low coverage regions that can bring to incorrect variant calling and produce erroneous assembled sequences.

Supervision of sequence assembly to avoid inconsistent or misleading assignment of virus lineage and clade [9, 10] , as well as evaluation of low coverage samples to prevent loss of epidemiological information are mandatory.

To this respect, we propose the Easy-to-use SARS-CoV-2 Assem-bler pipeline (ESCA): a novel reference-based genome assembly pipeline specifically designed for SARS-CoV-2 data analysis. This pipeline was created to support laboratories with low experience in bioinformatics or in SARS-CoV-2 analysis. ESCA can be easily installed and runs in most Linux environments.

ESCA pipeline is a reference-based assembly algorithm, written for Linux environments, requires only raw reads as input files, without any other information. Two versions of the software are avail-able: for Illumina paired-end reads in "fastq.gz" file format, and for IonTorrent reads in "ubam" file format.

The software is designed to process several samples in a single run. All reads (paired or unpaired) must be copied into the same working directory, and then launched the program through com-mand line by digiting "StartEasyTorrent" for IonTorrent input or "StartEasyIllumina" for Illumina input. The pipeline than performs automatically all other passages described below. The program processes all input reads, dividing them in different samples using file names as identifiers; Illumina paired-end reads are expected to be divided into two files, with file names that start with the same first 5 letters and that contain "R1" or "R2" to dis-tinguish between forward reads and reverse reads.

Sample preprocessing is performed filtering out all reads with mean Phred quality score lower than 20 and less than 30 nucleo-tides long. Filtered reads are mapped on SARS-CoV-2 reference genome Wu-han-Hu-1 (GenBank Accession Number NC_045512.2) with bwamem software [15] ; all reads that do not map on the reference ge-nome are then discarded.

Genome coverage is then analyzed: read-mapping file is converted in "sorted-bam" and "mpileup" files using samtools software [16] and those data are translated in a detailed coverage table that re-ports the count of nucleotides observed at each position.

Consensus sequence is then reconstructed based on three parameters:

The frequency of nucleotides observed at each position.

(2) The nucleotide coverage.

(3)

The reference genome sequence.

Briefly, sample parameters for consensus sequence reconstruction are designed to call the nucleotide observed with frequency >50% and with a coverage >50 reads; but the minimum coverage is re-duced at >10 reads if the most frequent nucleotide observed is identical to the nucleotide observed into the reference genome. For all positions where the parameters described above are not satisfied, ESCA pipeline is designed to call "N" for indicating a low coverage position or an intra-sample nucleo-tide variant. To test the performances with respect to mean coverage, linear re-gression correlation was carried out between mean coverage and specific measures of accuracy.

In the computational evaluation, ESCA software was compared with the most used assemblers for SARS-CoV-2 genome analysis, on 228 SARS-CoV-2 positive samples. To evaluate the ESCA and Dragen/IRMA results, assembled ge-nomes, reference wuhan-Hu-1 and corrected genome of GISAID (Accession IDs available in Supplementary Table 1) were aligned with MAFFT [19] . At each position along SARS-CoV-2 genome, the 24 available nu-cleotide combinations were classified in 11 mutation categories (Fig. 1) . For all sequences were then evaluated the number of oc-currences of mutation categories for each assembly software. Nucleotide three-somes were classified in the eleven categories below.

The comparison of ESCA vs Dragen showed that, as expected, the mean number of mutations in genomes were very low (in mean 28 position) and ESCA could identify correctly 27/28 mutations in mean (Fig2a) . Moreover, no FN positions were identified to ESCA. This Moreover, the sensitivity in ESCA vs Dragen was 96.43% vs 89.29% respectively and the specificity in both methods were 100.00%. 

The importance to rapidly obtain and share high quality whole ge-nomes of SARS-CoV-2 is increasing with the emerging variant strains [14] . For this reason, use of customed ampliconbased se-quencing kits can be a rapid and performing method to identify viral variants.

However, the lack of bioinformatic skills could be a trouble to handle NGS raw data. Our pipeline ESCA, provides to help laboratories with low bioinformatics capacity, using a single command. Both the methods more common in the analysis of IonTorrent and Illumina data, IRMA and Dragen respectively, shown some error that could induce false identification in variants assignation. On the contrary, SARS-CoV-2 genome obtained by ESCA shows a reduced number of false insertions, false mutations and a higher number of real mutations.

Moreover, ESCA make automatically a variant table output file, fundamental to rapidly recognize variants of interest. All these results shown how ESCA could be a useful method to obtain a rapid, complete and correct analysis also with minimal skill in bioinformatics.

Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health

Comprehensive mapping of mutations in the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human plasma antibodies

Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic

Survey on the Use of Whole-Genome Sequencing for Infectious Diseases Surveillance: Rapid Expansion of European National Capacities

H; FWD-NEXT Expert Panel. PulseNet In-ternational: Vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance

Data, disease and diplomacy: GISAID′s inno-vative contribution to global health

Database resources of the National Center for Biotechnology Information

An interactive web-based dashboard to track COVID-19 in real time

Erratum in: Lancet In-fect Dis

Nextstrain: real-time tracking of pathogen evolution

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Karthik Gangavarapu, Laura D. Hughes, and the Center for Viral Systems Biology outbreak.info. Available online

Comprehensive mapping of mutations in the SARS-CoV-2 recep-tor-binding domain that affect recognition by polyclonal human plasma anti-bodies

Neutralising antibody escape of SARS-CoV-2 spike pro-tein: Risk assessment for antibody-based Covid-19 therapeutics and vac-cines

Epub ahead of print

SARS-CoV-2 genomic sequencing for public health goals: Interim guidance

Fast and accurate short read alignment with Burrows-Wheeler transform

Abe-casis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Se-quence Alignment/Map format and SAMtools

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

We gratefully acknowledge the contributors of genome sequences of the newly emerging coronavirus, i.e. the Originating Laboratories, in sharing sequences and other metadata through the GISAID Initiative, on which this research is based. We also acknowledge Ornella Butera, Francesco Santini e Giulia Bonfiglio for their contribution in sample preparation.