key: cord-267845-18hb5ndr authors: Resende, Paola Cristina; Motta, Fernando Couto; Roy, Sunando; Appolinario, Luciana; Fabri, Allison; Xavier, Joilson; Harris, Kathryn; Matos, Aline Rocha; Caetano, Braulia; Orgeswalska, Maria; Miranda, Milene; Garcia, Cristiana; Abreu, André; Williams, Rachel; Breuer, Judith; Siqueira, Marilda M title: SARS-CoV-2 genomes recovered by long amplicon tiling multiplex approach using nanopore sequencing and applicable to other sequencing platforms date: 2020-05-01 journal: bioRxiv DOI: 10.1101/2020.04.30.069039 sha: doc_id: 267845 cord_uid: 18hb5ndr Genomic surveillance has become a useful tool for better understanding virus pathogenicity, origin and spread. Obtaining accurately assembled, complete viral genomes directly from clinical samples is still a challenging. Here, we describe three protocols using a unique primer set designed to recover long reads of SARS-CoV-2 directly from total RNA extracted from clinical samples. This protocol is useful, accessible and adaptable to laboratories with varying resources and access to distinct sequencing methods: Nanopore, Illumina and/or Sanger. The novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS- belonging to the family of Coronaviridae and to the genus Betacoronaviridae, emerged in Wuhan, China in December 2019, and has already been introduced in 185 countries to date (1, 2) . In UK it has caused more than 166,443 reported cases and 26,097 death and in Brazil 80,246 cases and 5,541 deaths have already been reported, last update April 30 th , 2020 (2) . On March 11th, the WHO declared a SARS-CoV-2 pandemic, reinforcing the need for all countries to implement measures for rapid detection and characterization of the virus to help mitigate virus transmission. Genomic surveillance has become a useful tool for better understanding virus pathogenicity, origin and spread. Obtaining accurately assembled, complete viral genomes directly from clinical samples is still a challenging task due to the low amount of viral nucleic acid in the clinical specimen compared to host DNA, and to the size of SARS-CoV-2 genome, which is around 30 kb in length. Despite those limitations, we developed a sequencing protocol that successfully obtained whole genomes from SARS-CoV-2 positive samples referred to the National Reference laboratory at FIOCRUZ in Brazil. This protocol was further optimised for higher throughput sequencing at University College London Pathogen Genomics Unit and UCL Genomics to sequence genomes for the COVID-19 Genomics UK Consortium (COG-UK). The tiling amplicon multiplex PCR method has been previously used for virus sequencing directly from clinical samples to obtain consensus genome sequences (3). This protocol has been applied to Ebola, Zika, Chikungunya and SARS-CoV-2 sequencing (3) (4) (5) (6) using preferentially short amplicons (~ 450 base pairs). Nanopore sequencing allows rapid turnaround times (1-2 days) for obtaining a consensus sequence directly from clinical samples and a allows faster response during an outbreak. In this study, we adapted this protocol to recover longer 2 kb reads, decreasing the number of primers required and thus reducing possible mismatches and/or undesired interactions. Additionally, it is easier to assemble larger viral genomes from longer reads enabling higher depth coverage (more than 100 x) in a reduced sequencing time. Here, we describe three protocols using a primer set designed to sequence SARS-CoV-2 directly from total RNA extracted from clinical samples, which were initially diagnosed using real-time RT-PCR (7, 8) . The protocols described herein can be applied to different sequencing platforms, such as Sanger, Illumina and Oxford Nanopore, and therefore are useful, accessible and adaptable to laboratories with different resources and sequencing facilities. By using this protocol, we generated 18 SARS-CoV-2 genomes (15 from clinical samples and 3 isolates from A set of 17 primer pairs (Supplementary Table 1 Table 1 ). The primers were tested in silico using the Geneious R9 software against the 19 available SARS-CoV-2 genomes at that moment. To test the efficiency of each primer pair (10 uM) we performed conventional Sanger sequencing with two positive samples detected in Reverse transcription was initially performed using SuperScript™ IV First-Strand Synthesis System (Invitrogen), using total RNA from samples presenting Ct values ≤30 for gene E (7, 9) . Two multiplexed PCR products (Pool A = 9 primers pairs and Pool B = 8 primers pairs) were generated using the Q5® High-Fidelity DNA Polymerase (NEB) and the primer scheme described in Table 1 . The PCR products were purified using Agencourt AMPure XP beads (Beckman Coulter™) and the DNA concentration measured by the Qubit 4 Fluorometer 4 (Invitrogen) using the Qubit dsDNA HS Assay Kit (Invitrogen). DNA products (Multiplex PCR pools A and B) were normalised and pooled together in a final concentration of 50 fmol. The Nanopore library protocol is straightforward as this method is optimised for long reads, such as the generated 2 kb amplicons. Library preparation was conducted using Ligation Sequencing 1D (SQK-LSK109 Oxford Nanopore Technologies (ONT) and Native Barcoding As Illumina sequencing chemistry is geared towards sequencing short reads, DNA libraries were generated from the pooled amplicons using Nextera XT DNA Sample Preparation Kit (Illumina, San Diego, CA, USA) according to the manufacturer specifications. The size distribution of the libraries was evaluated using a 2100 Bioanalyzer (Agilent, Santa Clara, USA) and the samples were pair-end sequenced (2 x 300bp) on a MiSeq v3600 cycle (Illumina, San Diego, USA). Different data analysis pipelines for Illumina and Oxford Nanopore sequencing were used to extract the consensus files from the raw data. Demultiplexed Fastq files generated from the Illumina sequencing data were used as an input for the analysis. Reads were trimmed based on quality scores with a cutoff of q30 used to remove low quality regions and adapter sequences were removed. The reads were mapped 5 to Wuhan Strain MN908947, duplicate reads were removed from the alignment and the consensus sequence called at a threshold of 10X. The entire workflow was carried out in CLC Genomics Workbench software version 20.0. For the Oxford Nanopore sequencing data, the high accuracy base called fastq files were used as an input for analysis. The pipeline used was an adaptation of the artic-ncov2019 medaka workflow (https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html). We used an earlier version of the workflow which used Porechop to demultiplex the reads. The mapping to the Wuhan reference sequence (MN908947) was done using Minimap2 with Medaka used for error correction. This was all carried out within the artic-ncov2019-medaka conda environment (https://github.com/artic-network/artic-ncov2019). To put the genomes from Brazil and UK generated using this protocol in a global context, SARS-CoV-2 genomes from other countries were recovered from GISAID. Any sequences of length less than 29000 nucleotides, having quality issues on GISAID, or where we detected an unusual frameshifting deletion or insertion relative to the SARS-CoV-2 reference sequence were not included in the phylogenetic reconstruction. To identify similar genomes to the genomes produced and not available yet in GISAID we used the CoV-GLUE website (http://cov-glue.cvr.gla.ac.uk/#/home), and we used the NextStrain website (https://nextstrain.org/ncov/global), in order to observe the topology of genomes already available in GISAID. The final curated dataset consisting of 122 SARS-CoV-2 genome sequences was aligned using MAFFT (10) . Model testing was carried out using jModelTest (11) Here we introduce a versatile sequencing protocol to recover the complete SARS-CoV-2 genome based on reverse transcription plus an overlapping long amplicon multiplex PCR strategy, and associated with pipelines to report the data, and recover the consensus files. The protocol was validated with RNA extracted from some of the first COVID-19 cases detected in Brazil and then optimized and developed for automation at two sequencing facilities at UCL (PGU and UCL Genomics) in London UK. Alternative protocols for Illumina platform, based on an initial amplification of larger fragments (8kb and 10.5kb) produced by one-step RT-PCR with high fidelity enzyme blends were also tested. However, they were prone to producing false mutations, likely due errors during amplification. Based on the fact that SARS-CoV-2 remains conserved, presenting few mutations scattered throughout the genome, the possibility of artificial mutations must be ruled out. We have demonstrated that this overlapping long amplicon multiplex PCR protocol suitable for samples with a wide range of viral loads, generating high coverage throughout the viral genome without artificial indels. It worked well on all four platforms tested (MinION, GridION, Illumina and Sanger) making it suitable for labs with distinct expertise, enabling successful rapid sequencing recovery of the SARS-CoV-2 genome directly from clinical samples. The sequencing workflow optimizations were conducted with the purpose of protocol development and the samples used for this optimization were collected as part of the National Brazilian Surveillance and COG-UK London. We did not use any clinical information or any patient data in this study. * Lineage based on Pangolin version 1 subtyping tool (https://github.com/hCoV-2019/pangolin); ** CT = Cycle threshold , samples from Brazil and UK had the ct value measured by different RT-PCR protocols. Coronaviriae Study Group of the International Committee on Taxonomy of V. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nat Microbiol An interactive web-based dashboard to track COVID-19 in real time Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples Real-time, portable genome sequencing for Ebola surveillance Circulation of chikungunya virus East/Central/South African lineage in Rio de Janeiro nCoV-2019 sequencing protocol Extraction-free COVID-19 (SARS-CoV-2) diagnosis by RT-PCR to increase capacity for national testing programmes during a pandemic Maria Diagnostic detection of 2019-nCoV by real-time RT-PCR Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR MAFFT multiple sequence alignment software version 7: improvements in performance and usability jModelTest: phylogenetic model averaging RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models We acknowledge the originators of sequences in GISAID (www.gisaid.orgacknowledgments in supplementary Table 1