key: cord-0275725-lsqg65ez authors: Pillay, Sureshnee; Giandhari, Jennifer; Tegally, Houriiyah; Wilkinson, Eduan; Chimukangara, Benjamin; Lessells, Richard; Moosa, Yunus; Gazy, Inbal; Fish, Maryam; Singh, Lavanya; Khanyile, Khulekani Sedwell; Fonseca, Vagner; Giovanetti, Marta; Alcantara, Luiz Carols; de Oliveira, Tulio title: Whole Genome Sequencing of SARS-CoV-2: Adapting Illumina Protocols for Quick and Accurate Outbreak Investigation During a Pandemic date: 2020-06-10 journal: bioRxiv DOI: 10.1101/2020.06.10.144212 sha: f51c6dd9c64c64d413c5aacc8a85fe8b23d0adf6 doc_id: 275725 cord_uid: lsqg65ez The COVID-19 pandemic spread very fast around the world. A few days after the first detected case in South Africa, an infection started a large hospital outbreak in Durban, KwaZulu-Natal. Phylogenetic analysis of SARS-CoV-2 genomes can be used to trace the path of transmission within a hospital. It can also identify the source of the outbreak and provide lessons to improve infection prevention and control strategies. In this manuscript, we outline the obstacles we encountered in order to genotype SARS-CoV-2 in real-time during an urgent outbreak investigation. In this process, we encountered problems with the length of the original genotyping protocol, reagent stockout and sample degradation and storage. However, we managed to set up three different library preparation methods for sequencing in Illumina. We also managed to decrease the hands on library preparation time from twelve to three hours, which allowed us to complete the outbreak investigation in just a few weeks. We also fine-tuned a simple bioinformatics workflow for the assembly of high-quality genomes in real-time. In order to allow other laboratories to learn from our experience, we released all of the library preparation and bioinformatics protocols publicly and distributed them to other laboratories of the South African Network for Genomics Surveillance (SANGS) consortium. . Despite widespread attempts to contain the virus in China, within a few months the outbreak had reached and affected 215 countries and territories around the world, including all countries in Africa. Localized outbreaks are common when SARS-CoV-2 is introduced in a new geographic area. These outbreaks require urgent testing and tracing, identification of the causative agent and epidemiological investigations to allow for appropriate infection control and for clusters of patients to be identified [3] . Genomic sequencing can be used to rapidly and accurately identify the transmission routes of the pathogen [4, 5] . This can then be used to trace the path of transmission within a population and to possibly identify the probable source, potentially leading to an improved public health response [6] [7] [8] [9] [10] [11] [12] . On 5 March 2020, the first COVID-19 case was reported in KwaZulu-Natal, South Africa from a traveller from Italy. On 9 March 2020, another returned traveller from Europe started a large chain of infection in a hospital in Durban, KwaZulu-Natal [13] . On 15 of March 2020, South Africa declared a state of emergency and on 27 March a country wide lockdown was implemented, which included grounding most of the international flights. In this study, we outline the processes that we went through to set up genomic sequencing in order to investigate the previously mentioned hospital outbreak in South Africa. The process was more complicated than we expected as, in addition to needing to generate data quickly, we also encountered problems with reagent stockout, importing sequencing reagents and sample degradation and storage. We received 108 COVID-19 positive samples (sampled from the end of March to the beginning of May 2020). Of these, 77 were from nasopharyngeal and oropharyngeal swabs and 31 were extracted RNA. The average age of the patients was 51 years [ranging from 23-91 years], with a gender distribution of 70.5% females and 29.5% males. Of the 108 individuals, 63 were from the hospital outbreak, 30 were from randomly selected positive individuals sampled in the same city but unrelated to the outbreak at the time of sampling. These last samples were used as control for the outbreak investigation. Furthermore, we also sequenced 13 samples delivered from the hospital outbreak but we could not identify the patient or health care worker from the sample ID or the sample ID has been lost in the transport or in storage before coming to our labs. We started sequencing the day after we got involved in the outbreak investigation, on 5 April 2020, so there was no time to prepare. Priority was given to genome sequencing rather than qPCR as there were limited sample volumes and results were required urgently for outbreak investigations. One of the 108 samples failed library preparation; the remaining 107 samples were sequenced and used in the final analysis. It is important to note that these samples did not arrive at the same time. There were 12 shipments and the quality of the samples was not optimal as some were sent to multiple laboratories for diagnostics and were not well stored, i.e. many were stored for days or weeks at room temperature. Furthermore, an undisclosed number of samples were heat inactivated, sometimes more than once. Our laboratory received training in Oxford Nanopore Technologies SARS-CoV-2 sequencing from colleagues from Brazil and our first two sequencing runs involved 24 samples (n=12 each run) that were sequenced on the 7 April 2020. However, our flow cells were over two years old and had less than 600 active pores each, which produced low quality and lower coverage genome (data not shown). We then moved to use the ARTIC protocol [14] on the Illumina Miseq sequencing platform on 8 April 2020. We starting using the ARTIC protocol with no modification, which included the TrueSeq library preparation step. However, this was a very lengthy and laborious process, which took close to twelve hours of hands on time to produce the libraries for sequencing. Furthermore, we had reagents for only 24 samples as our order of Illumina TruSeq DNA Library preparation kit (x 96 sample libraries), which was placed in February and was enough to produce 480 genomes, had not arrived due to the restrictions on international flights. In order to complete the outbreak investigation, we needed to improvise and look for other reagents that could replace this library preparation kit and for reagents that were already available in South Africa. We started by approaching colleagues from six genomics laboratories in South Africa, but unfortunately nobody had a suitable library kit that they could lend us. We also phoned many companies in South Africa to see if they had any local stock. We found two NEB Next Ultra II library kits in stock at a local company, Inqaba Biotech. Inqaba Biotech couriered the kit to us from Johannesburg by car on 10 April. We were happy to have a suitable library kit but quickly realized that one of the components, the NEBNext Multiplex Oligo, which was necessary to run the NEB Next Ultra II library kits, was missing. To overcome this, we changed the protocol to perform adapter ligation using Normally, one produces a qPCR report before attempting sequencing. However, due to the limited RNA and urgent need to generate data to solve the nosocomial outbreak, we generated a qPCR in parallel to the sequencing process and only after we had generated genomes. In summary, the process in the lab involved three days to complete one round of sequencing, qPCR and analysis of the data ( Figure 1 ). The sequencing process involves RNA extraction, RT-PCR, PCR amplification, library preparation, Illumina sequencing, followed by genome assembly and sequence analysis. We started running 12 samples per round, which was increased to 24 per round during the process. In total, we sequenced 108 COVID-19 positive individuals in less than three weeks. The generation of high-quality genomes from the sequencing data was generally performed using a 3-step assembly and clean-up workflow ( Figure 2 ). This process involved a last manual step, when whole-genomes are polished and all of the mutations are visually checked from bam alignments for confirmation. We noticed that some of the adaptors were not well filtered in the assembled process and resulted in mis-called mutations. However, this was easy to spot in the bam files and we checked all the mutations visually. On average, our consensus started with 7 (4 -13) mutations and ended up with 5 (4 -6). The quality of our consensus sequences clearly increased as indicated by significantly higher (t-test) coverage, concordance, identity and matches after Step 2, and significant increases in matches after Step 3 (Supplementary Figure S1 , Table S2 ). This process produced 107 SARS-CoV-2 genomes, 54 of which were whole genomes (>90% coverage) and 53 partial genomes (mean genome length: 60.5%, IQR 51.5% -72.4%). There was a marked increase in coverage seen in samples with a lower Ct score ( Figure 3 ). There was a clear trade-off between Ct score and genome coverage and we chose three cut-off values to analyze the data, Ct <25, <27 and <30 ( Figure 3 ). We consistently show genomes of significantly higher coverage generated from samples with mean Ct scores below the chosen thresholds. For example, we had Ct score values for 51 of the 54 near fulllength genomes with a mean Ct score of 22.2 (19.7 -24.2 For this outbreak, we demonstrate that a Ct < 27 show a good trade-off between high-quality near complete genomes Ct score . However, 12 samples with a Ct < 27 did not produce whole genomes ( Figure 3 ). We associated this to sample degradation as many of the samples have been stored at room temperature. The three different library preparation methods, i.e. TruSeq DNA, NEBNext Ultra II and Nextera DNA Flex, produced similar results. In order to properly evaluate the performance of the library methods, we generated a subset of multiple sequencing for 12 samples, four of which were sequenced in triplicate and four in duplicate for each of the two new library preparation methods (Table 1) . For example, ten of the twelve sequences produced with the different methods provided very similar genome coverage and mutations. In addition, using a Ct Score lower than 30, the TruSeq DNA and NEBNext Ultra II protocols produced 14 and 40 whole-genomes, which represented 66.7% and 45.3% of the genomes produced, respectively. Table S2 ), which caused a non-synonymous mutation at the ORFab1 gene position. More detailed phylogenetic analysis are described in our early genomic epidemiology report of SARS-CoV-2 infection KZN and our investigation of the hospital outbreak [6, 13] . SARS-CoV-2 sequencing is a rapid and accurate method of analysing genetic variability and identifying transmission chains during outbreaks [15] . However, as outbreaks happen sporadically and cannot be predicted, it is not always possible to have all the resources required to perform the necessary tests in resource limited settings. As in many countries, there was a national lockdown in South Africa during the SARS-CoV-2 pandemic, with limited flights in and out of the country, thus making it difficult for suppliers to deliver reagents. In addition, reagents for SARS-CoV-2 were in great demand in the world making them more difficult to access. All samples from the outbreak investigation had initially been collected and tested at different private and government laboratories. Results were required urgently for genomic sequencing and therefore tracing the exact qPCR results from the original labs to determine the cycling threshold (Ct) scores was impractical. Ideally, as shown in other reports of SARS-CoV-2 genomic epidemiology [8] samples should be tested on qPCR first. Those with a Ct below 30 have been shown to produce longer and higher quality genomes. A recent study of high-throughput sequencing in Iceland showed up to 90% success rate of sequencing SARS-CoV-2 near full length genomes for samples with Ct < 30. However, this was a prospective study in a very developed country. We found that in our setting and using samples that had been not optimally collected a Ct score of < 27 provided the best cut-off to produce near full length genomes. In this study, we started using the ARTIC method [14] to amplify the SARS-CoV-2 using a tiling PCR approach. This method recommends the use the TruSeq DNA Nano kit for Illumina sequencing, which is a very labour intensive approach (up to 12h hands on time). As we ran out of reagents for TruSeq and needed to set up other methods quickly, we found a number of much better library sequencing kits that were either cheaper or were much less laborious. We evaluated and used the NEBnext Ultra II (New England BioLabs) and Nextera DNA Flex (Illumina). The NEBnext kit was used to generate the majority of the 108 genomes and 54 near full-length genomes. We also evaluated the Nextera Flex DNA library preparation kit, which saves up to 9h of hands on time, when compared with the original ARTIC protocol that uses TruSeq. All three library preparation methods produced high quality genomes. There was no significant difference seen in coverage. The marginally lower coverage seen with the NEBnext Ultra II kit could be due to the unavailability of the recommended adapters and the substitution with the TruSeq adapters. Slightly lower coverage seen with Nextera Flex DNA could be due to the samples being subjected to freeze thaw cycles as the Nextera Flex DNA kit was used a month after samples were received and tested using the other two kits. Further comparison of the kits looked at the cost and processing times for each of the methods used. The costs of the library preparation methods, including components not included, such as Ampure bead (Beckman Coulter) and End Repair/dA-Tailing Module (New England BioLabs) kits, varied with a difference of up to 30% (Supplementary Table S4 ). The NEBNext Ultra II DNA library method was found to be the cheaper option. Nextera DNA Flex was found to have the shortest processing time of less than three hours. While all three methods required end repair of amplicons prior to indexing, the Nextera Flex method encompassed the tagmentation step together with the ligation of adapters. There was no need to quantify and normalise individual libraries at the end of library preparation as normalization occurred during the tagmentation step. Our study has many limitations. Firstly, we did not have time to prepare properly for the initial sequencing as our access to the first positive samples was during a large nosocomial outbreak investigation. Secondly, the quality of the samples was not homogeneous, as some samples arrived at our laboratories weeks after being sampled from the patients. Thirdly, reagents stockouts were common during the lockdown in South Africa and we had to innovate and adapt the protocols. To summarise, despite the difficulties posed by the lockdown, we were able to complete the data generation and analysis of a large COVID-19 outbreak in South Africa in just a few weeks. We also evaluated the performance of three library preparation kits for their quality, cost, ease of use and time efficiency. In addition, we adapted a bioinformatics workflow to assemble SARS-CoV-2 genomes from raw sequence reads in near-real time. All of our protocols and raw data have been made publicly available and distributed to laboratories of the South African Network for Genomics Surveillance (SANGS) and the Africa Centre for Diseases Control (Africa CDC). We obtained remnant samples collected using nasopharyngeal and oropharyngeal swabs from COVID-19 confirmed patients. These samples consisted of either the primary swab sample or extracted RNA. We heated inactivated swab samples in a water bath at 60°C for 30 minutes, prior to RNA extraction. We extracted RNA on an automated Chemagic 360 instrument using the CMG-1049 kit (Perkin Elmer, Hamburg, Germany), or by manual extraction using the QIAamp Viral RNA Mini Kit (QIAGEN, California, USA). The RNA was stored at -80 C prior to use. We retrieved RNA samples stored at -80 o C and thawed them to room temperature prior to real-time PCR testing. We tested for three SARS-CoV-2 genes, i.e. ORF1ab, S protein and N protein, using the TaqMan 2019-nCoV assay kit v1 according to manufacturer's instructions. In summary, we prepared a 20µl mastermix for each target gene (i.e. ORF1ab, S protein and N protein) containing; 11.25µl of nuclease free water, 6.25µl of TaqMan Fast We performed cDNA synthesis using SuperScript IV reverse transcriptase (Life Technologies) and random hexamer primers, followed by gene specific multiplex PCR using the ARTIC protocol, as described previously [14] . In summary, we attempted SARS-CoV-2 whole genome amplification by multiplex PCR using primers designed on Primal Scheme (http://primal.zibraproject.org/) to generate 400 base pair (bp) amplicons with 70bp overlaps, covering the 30 kilobase SARS-CoV-2 genome. We purified PCR products in a 1:1 ratio using AmpureXP purification beads (Beckman Coulter, High Wycombe, UK), and quantified the purified amplicon using the Qubit dsDNA High Sensitivity assay kit on a Qubit 4.0 instrument (Life Technologies). We estimated amplicon fragment sizes on a LabChip GX Touch (Perkin Elmer, Hamburg, Germany) prior to library preparation. Depending on availability of reagents, we attempted library preparation using three different kits, namely TruSeq DNA Library Prep kit, NEBNEXT Ultra II DNA Library Prep Kit, and the Nextera DNA Flex Library Prep kit. We used the TruSeq Nano DNA library preparation kits ( using a MiSeq Nano Reagent Kit v2 (500 cycles). We were limited from using this method further due to insufficient reagents available for the library preparation. We Raw reads from Illumina sequencing were assembled using Genome Detective 1.126 (https://www.genomedetective.com/) and the coronavirus typing tool [16, 17] . The initial assembly obtained from Genome Detective was polished by aligning mapped reads to the references and filtering out mutations with low genotype likelihoods using bcftools 1.7-2 mpileup method. All mutations were confirmed visually with bam files using Geneious software. For samples with repeat sequencing, forward and reverse reads from all sequencing runs were merged respectively and assembled as one. All of the sequences were deposited in GISAID (https://www.gisaid.org/). Lineage assignments were established using a dynamic lineage classification method proposed by Rambault et al., [18] Step 1: Raw reads from Illumina and Nanopore sequencing were assembled using the web-based Genome Detective 1.126 (https://www.genomedetective.com/) platform and its coronavirus typing tool; Step 2: The initial assembly obtained from Genome Detective was polished by aligning mapped reads to the references and filtering out mutations with low genotype likelihoods using bcftools 1.7-2 mpileup method. This calculation determines the probability of a genotype at sites containing reads with various bases (e.g. the probability that position 27784 is A vs T in illustration above); Step 3: All mutations were validated visually with BAM files viewed in Geneious software to ensure that called mutations were true and not part of lingering adapter sites. A new coronavirus associated with human respiratory disease in China Coronaviruses and SARS-CoV-2: A Brief Overview A complete protocol for whole-genome sequencing of virus from clinical samples: Application to coronavirus OC43 Whole-genome sequencing in outbreak analysis Tracking virus outbreaks in the twentyfirst century Early transmission of SARS-CoV-2 in South Africa: An epidemiological and phylogenetic report Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California Spread of SARS-CoV-2 in the Icelandic Population An emergent clade of SARS-CoV-2 linked to returned travellers from Iran Introductions and early spread of SARS-CoV-2 in the New York City area Coast-to-coast spread of SARS-CoV-2 during the early epidemic in the United States A territory-wide study of early COVID-19 outbreak in Hong Kong community: A clinical, epidemiological and phylogenomic investigation Augustine's Hospital Forked from Ebola virus sequencing protocol Evaluation of Rapid Library Preparation Protocols for Whole Genome Sequencing Based Outbreak Investigation Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes Genome Detective: an automated system for virus identification from high-throughput sequencing data A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology DNA Sequence Analysis Lectures on Mathematics in the Life Sciences IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies