key: cord-0811851-u1bej79r authors: Vacca, D.; Fiannaca, A.; Tramuto, F.; Cancila, V.; La Paglia, L.; Mazzucco, W.; Gulino, A.; La Rosa, M.; Maida, C. M.; Morello, G.; Belmonte, B.; Casuccio, A.; Maugeri, R.; Iacopino, G.; Vitale, F.; Tripodo, C.; Urso, A. title: Direct RNA nanopore sequencing of SARS-CoV-2 extracted from critical material from swabs date: 2020-12-23 journal: nan DOI: 10.1101/2020.12.21.20191346 sha: f267c7968560e059c8d46ed5987e97434a798924 doc_id: 811851 cord_uid: u1bej79r ABSTRACT In consideration of the increasing prevalence of COVID-19 cases in several countries and the resulting demand for unbiased sequencing approaches, we performed a direct RNA sequencing experiment using critical oropharyngeal swab samples collected from Italian patients infected with SARS-CoV-2 from the Palermo region in Sicily. Here, we identified the sequences SARS-CoV-2 directly in RNA extracted from critical samples using the Oxford Nanopore MinION technology without prior cDNA retro-transcription. Using an appropriate bioinformatics pipeline, we could identify mutations in the nucleocapisid (N) gene, which have been reported previously in studies conducted in other countries. To the best of our knowledge, the technique used in this study has not been used for SARS-CoV-2 detection previously owing to the difficulties in the extraction of RNA of sufficient quantity and quality from routine oropharyngeal swabs. Despite these limitations, this approach provides the advantages of true native RNA sequencing, and does not include amplification steps that could introduce systematic errors. This study can provide novel information relevant to the current strategies adopted in SARS-CoV-2 next-generation sequencing. We deposited the gene sequence in the NCBI database under the following URL:https://www.ncbi.nlm.nih.gov/nuccore/MT457389 The study of the SARS-CoV-2 genome has become a priority for global healthcare to 49 facilitate the identification of more suitable vaccines and therapeutic drugs (1) . 50 The use of third generation sequencing technologies has increased significantly in 51 recent years, as these methods yield reliably long reads even from biological samples 52 with noise (2). 53 Direct RNA sequencing (RNA-seq) of SARS-CoV-2 has the advantage of displaying its 54 native sequence without contamination by artefacts originating from the in vitro cDNA 55 amplification step, which is often necessary for target enrichment (3). However, the 56 quality of RNA extracted from the swab has certain critical issues with respect to both 57 fragmentation level and concentration, such that the read coverage of sequencing 58 decreases considerably. In this study, we investigate the suitability of Nanopore Oxford 59 MinION Mk1B (4), a third-generation nanopore-based platform, for the identification of a 60 critical target of SARS-CoV-2 native RNA from a swab sample, and its sensitivity in the 61 characterization of the viral mutational landscape. 62 To this end, we analyzed ten RNA samples extracted from oropharyngeal swabs of ten 63 patients with COVID-19, and sequenced them in two pools following the direct RNA 64 sequencing protocol (SQK-RNA002, Nanopore Technologies). Although the 65 concentration of the library loaded in the sequencing flow cell was approximately 200 66 times lower than that required for the protocol, we clearly identified two mutations in the 67 nucleocapsid phosphoprotein (N) gene with respect to the sequence of the strain 68 isolated in Wuhan, which has been described in literature. 69 After the collection of ten oropharyngeal swab samples from Italian patients with 72 COVID-19, we assayed the concentration and fragmentation level of the extracted RNA 73 to arrange it for sequencing, as recommended by Oxford Nanopore in the Input 74 DNA/RNA quality control guidelines (5). Next, we prepared two sequencing libraries 75 from two distinct pools, as described below, and launched a computational pipeline for 76 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The assessment of the R9.4 Flow Cell, which was performed prior to the loading of the 134 libraries, revealed that 1575 active pores were available for sequencing. 135 We performed a new MinKNOW (6) (v19.12.5) experiment using only the A loading mix 136 in the flow cell. 137 In MinKNOW we selected the SQK-RNA002 Kit and continued the sequencing process 138 until approximately 80% pores were available (i.e., after 4 h). At this point, we paused 139 the process and loaded the B loading mix in the flow cell. Next, the run was resumed 140 and continued till the number of reads generated was unvaried at 397k (i.e., after 141 approximately 27 h). At the end of the sequencing process, we obtained the "Fast5" raw 142 data files. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 23, 2020. ; https://doi.org/10.1101/2020.12.21.20191346 doi: medRxiv preprint PycoQC can subsequently provide an overview of the overall quality of the reads. To 154 clean the input data, we filtered the quality and read length using the NanoFilt (v2.7.0) 155 tool (9). This tool can also analyze the sequencing summary file generated by 156 guppy_basecaller to refine the filtering process. For these reasons, we filtered the reads 157 using the sequencing summary file under the following parameters: minimum read 158 length ≥ 500 nt and read quality ≥ 8. 159 Thereafter, only the filtered reads were considered in further analysis. 160 At this point, we created an alignment process that mapped reads to a reference 161 genome and identified the exact genomic loci corresponding to each read. Given that, 162 we did not use primers for the amplification of the SARS-CoV-2 genome, we had to 163 remove all the reads that were mapped with other material sequenced from the swabs. 164 In this study, to remove the contaminating sequences, we considered humans, fungal, 165 and bacterial reference genomes, respectively. To elaborate, we used the "Homo 166 sapiens genome assembly GRCh38 (hg38)" from the Genome Reference Consortium 167 (10) as the human reference genome, and all sequences from both fungal and bacterial 168 genome projects from NCBI (11) . Lastly, we used the NCBI SARS-CoV-2 sequence 169 NC_045512.2 (12) as a reference genome for SARS-CoV-2. 170 For each of these reference genomes, we mapped the reads using the minimap2 171 (v2.17-r941) tool (13) based on earlier reports that demonstrated the effective 172 performance of this tool with the splice-aware alignment of Nanopore Direct RNA reads 173 against a reference genome (14). As suggested by authors who reported the 174 performance of minimap2, we set the parameter k = 13 to increase the sensitivity and to 175 map noisy Nanopore Direct RNA-seq reads. 176 Next, we extracted both unmapped reads and reads with mapping quality lower than 10 177 using the "view" utility in the samtools (v1.7) library (15) for each reference genome 178 except for that of SARS-CoV-2. 179 Resultantly, we obtained a sub-set of the sequenced long-reads that did not map with 180 the genomes of human, fungi, or bacteria. As we stated previously, the overall quality of reads is affected by the adopted 256 sequencing technique; i.e., the direct RNA sequencing of samples derived from the 257 swabs. In this experiment, the average read length was lower than that of standard 258 long-reads, and the quality of certain reads was below conventional levels. Therefore, 259 we preferred to discard the reads that had low quality (< 8) as well as a short length (< 260 500 nt). We obtained a subset of 20,940 good-quality reads. 261 At this point, we aimed to identify and remove sequences from contaminant species 262 contained in the swab; for this, we used the minimap2 tool (13) to filter out reads that 263 mapped with sequences from human, fungal, and bacterial genomes. 264 Using this method, we filtered 97.63% of the reads; details of the composition of these 265 reads are provided in Tab. S4. Lastly, we attempted to map the remaining 2.37% of 266 reads with the SARS-CoV-2 reference genome, and observed that 10.89% of the reads 267 (i.e., 54 reads) mapped. These reads did not cover the entire SARS-CoV-2 reference 268 genome; however, these were sufficient for mapping the N gene. 269 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. To determine the frequency of mutation in the samples, we adopted two different 287 approaches: Sanger sequencing and qPCR. Using the former approach, we amplified 288 the region that contained the mutation, as described in the Materials and Methods, prior In this context, swab RNA, which has a lower concentration and is also more 320 fragmented than in vitro RNA, may not be suitable for performing direct sequencing. CoV-2 genomes (last download on 12-Jun-2020) from several countries worldwide. We 342 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 23, 2020. ; https://doi.org/10.1101/2020.12.21.20191346 doi: medRxiv preprint retrieved this dataset and prepared a script to check for the presence of the identified 343 mutation region (NC_045512:c.28881_28882_28883delinsAAC) in the sequences of N 344 gene present in CNCB's dataset. As shown in Fig. 4A , we considered three sets: 1) 345 cases that only contained identified mutations, 2) cases that also contained identified 346 mutations, and 3) cases that did not contain the identified mutations. We found that the 347 same N gene sequence reported by us was reported in approximately 18% of COVID-348 19 cases, and we took into account the distribution of these cases over time. 349 As shown in Fig. 4B , we reported the presence of the mutation in cases treated per 350 week from February to June, and noticed that the majority of cases were reported 351 between the second week of March and the fourth week of April. This period coincided 352 with that of swab collection from patients. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 23, 2020. ; https://doi.org/10.1101/2020.12.21.20191346 doi: medRxiv preprint molecular targets for the identification of specific genomic mutations was confirmed, and 365 could be considered advantageous for both surveillance and epidemiologic studies. 366 367 Data availability 368 The data that support the findings of this study are available from the corresponding 369 author upon reasonable request. DV: Data curation, Formal analysis, Methodology, Software, Writing, Editing; AF: Data 376 curation, Formal analysis, Methodology, Software, Writing, Editing; FT: Investigation Resources, Visualization; VC: Investigation, Methodology, Visualization Resources, Visualization; AG: 379 Data curation, Visualization; MLR: Data curation, Methodology, Visualization, Review, 380 Editing; CM: Data curation, Methodology, Resources, Visualization; GM: Data curation Visualization; BB: Data curation, Visualization; AC: Data curation Investigation, Resources, Visualization; RM: Resources, Visualization; GI: Resources Visualization; FV: Conceptualization, Data curation, Investigation, Supervision Review, Editing; CT: Conceptualization, Data curation, Funding 385 acquisition, Investigation, Project administration, Supervision, Visualization, Writing Editing; AU: Conceptualization, Data curation, Investigation, Supervision Structural Similarity of SARS-CoV2 Mpro and HCV 394 NS3/4A Proteases Suggests New Approaches for Identifying Existing Drugs Useful as 395 COVID-19 Therapeutics Nanopore sequencing: Review of potential applications in functional 398 genomics Effect 400 of the enzyme and PCR conditions on the quality of high-throughput DNA sequencing 401 results Using Direct RNA Nanopore Sequencing to Deconvolute Viral 403 Oxford Nanopore Technologies, Nanopore Oxford MinION 1kb pycoQC, interactive quality control for Oxford Nanopore Sequencing. 413 Journal of Open Source Software Christine Van Broeckhoven, 415 NanoPack: visualizing and processing long-read sequencing data Evaluation of GRCh38 and de novo 419 haploid genome assemblies demonstrates the enduring quality of the reference assembly Accession No. NC_045512.2, Severe acute respiratory syndrome 425 coronavirus 2 isolate Wuhan-Hu-1, complete genome Minimap2: pairwise alignment for nucleotide sequences GSAlign: an efficient sequence alignment tool for intra-species genomes Genome Project Data Processing Subgroup, The 432 Sequence alignment/map (SAM) format and SAMtools A statistical framework for SNP calling, mutation discovery, association mapping and 434 population genetical parameter estimation from sequencing data Variant Review with the Integrative Genomics Viewer (IGV) Real-time detection of BRAF V600E mutation from 440 archival hairy cell leukemia FFPE tissue by nanopore sequencing Impact of DNA source on genetic variant detection from 443 human whole-genome sequencing data The impact of DNA input amount and DNA source on the 446 performance of whole-exome sequencing in cancer epidemiology Evaluation of commercially available RNA 449 amplification kits for RNA sequencing using very low input amounts of total RNA Using Direct RNA Nanopore Sequencing to Deconvolute Viral 452 The 2019 novel coronavirus resource All authors no reported conflicts. 391 392