key: cord-0919719-dhde7pi9 authors: Doddapaneni, Harsha; Cregeen, Sara Javornik; Sucgang, Richard; Meng, Qingchang; Qin, Xiang; Avadhanula, Vasanthi; Chao, Hsu; Menon, Vipin; Nicholson, Erin; Henke, David; Piedra, Felipe-Andres; Rajan, Anubama; Momin, Zeineen; Kottapalli, Kavya; Hoffman, Kristi L.; Sedlazeck, Fritz J.; Metcalf, Ginger; Piedra, Pedro A.; Muzny, Donna M.; Petrosino, Joseph F.; Gibbs, Richard A. title: Oligonucleotide capture sequencing of the SARS-CoV-2 genome and subgenomic fragments from COVID-19 individuals date: 2020-07-27 journal: bioRxiv DOI: 10.1101/2020.07.27.223495 sha: d17e976d7acdfa0c5aee3d1789edc08644f81768 doc_id: 919719 cord_uid: dhde7pi9 The newly emerged and rapidly spreading SARS-CoV-2 causes coronavirus disease 2019 (COVID-19). To facilitate a deeper understanding of the viral biology we developed a capture sequencing methodology to generate SARS-CoV-2 genomic and transcriptome sequences from infected patients. We utilized an oligonucleotide probe-set representing the full-length genome to obtain both genomic and transcriptome (subgenomic open reading frames [ORFs]) sequences from 45 SARS-CoV-2 clinical samples with varying viral titers. For samples with higher viral loads (cycle threshold value under 33, based on the CDC qPCR assay) complete genomes were generated. Analysis of junction reads revealed regions of differential transcriptional activity and provided evidence of expression of ORF10. Heterogeneous allelic frequencies along the 20kb ORF1ab gene suggested the presence of a defective interfering viral RNA species subpopulation in one sample. The associated workflow is straightforward, and hybridization-based capture offers an effective and scalable approach for sequencing SARS-CoV-2 from patient samples. The COVID-19 pandemic has spread worldwide with alarming speed and has led to the worst healthcare crisis in a century. The agent of COVID-19, the novel SARS-CoV-2 coronavirus (family Coronaviridae), has a ~30 Kb positive-sense single-stranded RNA genome predicted to encode ten open reading frames (ORFs) 1 . Similar to other RNA viruses, coronaviruses undergo mutation and recombination 2,3 that may be critical to understanding physiological responses and disease sequelae, prompting the need for comprehensive characterization of multiple and varied viral isolates. To date, reports highlighting genomic variation of SARS-CoV-2 have primarily used ampliconbased sequencing approaches (e.g., ARTIC) 4-7 . Attaining uniform target coverage is difficult for amplicon-based methods, and is exacerbated by issues of poor sample quality 8 . Genome variation in the amplicon primer region may also impact sequence assembly. Transcriptome characterization can further contribute to our knowledge of mutation within the SARS-CoV-2 genome, and direct RNA long read sequencing, both alone and in combination with short read sequencing, have been described 1, 9, 10 . Unfortunately, these analyses are equally hampered by sample quality limitations and necessitate use of cultured cell lines. Oligonucleotide capture ('capture') mitigates these issues as hybridization to specific probes not only enriches for target sequences but enables the analysis of degraded source material [11] [12] [13] [14] . Capture enrichment has also been applied to viral sequencing, where a panvirome probe design resulted in up to 10,000-fold enrichment of the target sequence and flanking regions 15, 16 . Hybridization-based enrichment of RNAs can also aid in the identification of gene fusions or splice variants 12, 17, 18 , which are particularly important for coronavirus biology. In addition to encoding a polyprotein that undergoes autocatalyzed hydrolysis, coronaviruses employ subgenomic RNA fragments generated by discontinuous transcription to translate proteins required for viral replication and encapsidation. These subgenomic RNA fragments share a common 62-bp leader sequence derived from the 5' end of the viral genome, detectable as a fused junction to interior ORFs 1,9 . Direct RNA sequencing of cultured cell lines infected with SARS-CoV-2 revealed that the junctional sequences are not evenly distributed between the ORFs, suggesting that individual proteins may be translated at different rates 1 . How virus translation profiles from infected human patients differ from those from cultured cells is as yet unknown. Here we have utilized capture probes and a streamlined workflow for sequence analysis of both the SARS-CoV-2 genomic sequences and of the junction reads contained within the genomic subfragments generated by discontinuous transcription (Fig.1) . The method can be applied at scale to analyze samples from clinical isolates. Enriching for genomic and transcriptional RNA, followed by deep short-read sequencing, sheds light on variation in clinical SARS-CoV-2 genomic sequences and expression profiles. A total of 45 samples collected from 32 patients between March 18 and April 25, 2020 in Houston, Tx, USA were analyzed. These were a subset of individuals tested for the presence of SARS-CoV-2 early during the pandemic. RNA fractions were isolated from viral transport media and converted to cDNA. SARS-CoV-2 cDNA libraries were pooled into six groups (Supplementary Table 1 ). All 45 capture-enriched and nine of the pre-capture libraries were sequenced on an Illumina platform based on details provided in the online methods. A schematic workflow is shown in Fig.1 . A total of 7.15 billion raw reads were generated for the 45 SARS-CoV-2 positive samples sequenced (Supplementary Table 1 ). Sequences were trimmed to filter low quality reads and subsequently mapped to the GRCh38 reference genome to identify human reads ( Fig. 2A) . Trimmed non-human sequence reads were analyzed using the VirMAP 19 pipeline where between 7-86.4% of total reads from post-capture libraries mapped to the SARS-CoV-2 reference. One sample (192000446B), which had only 6.37 ng total RNA starting material, did not generate any SARS-CoV-2 reads. Overall, the percentage of reads represented by SARS-CoV-2 was higher in samples with CDC protocol-based RT-qPCR Ct values <33 ( Fig. 2A) . To estimate the capture enrichment efficiency, pre-capture libraries for nine samples, ranging in Ct values of 20.4 to 37.95 (i.e. high to low titer in the original samples), were also sequenced, generating 152.1 -322.9 million reads per sample. Samples 192000106B and 192000090B, with Ct > 37 produced zero reads mapping to the SARS-CoV-2 reference genome. In the remaining seven samples, less than 0.022% of reads were deemed SARS-CoV-2 (Fig. 2B) . Collectively, post-capture enrichment increased the SARS-CoV-2 mapping rate to 50.9%, a 9,243-fold enrichment. Spiked synthetic SARS-CoV-2 RNA, encompassing six fragments of 5 Kb each, served as a positive control and were enriched successfully at both 1,500 and 150k copies per sample (Fig. 2C ). In the 1,500 copy libraries (n=2), 3-5% of reads mapped to the SARS-CoV-2 genome, while approximately 65% of reads from the 150k copy libraries (n=2) did the same (Supplementary Table 1 ). This translates to an approximate 91,858-fold enrichment in the 1,500 copy libraries and 13,778-fold enrichment in the 150k copy libraries compared to their starting amounts in the RNA. Three SARS-CoV-2 PCR negative samples were also sequenced, where <0.5% of reads mapped to the SARS-CoV-2 reference genome at 3-5 locations that are not conserved in the SARS-CoV-2 genome (Supplementary Table 1 ; Supplementary Fig. 1 ). In order to assess the ability of the capture methodology to assemble full-length genomes, both the nine pre-capture and 45 post capture libraries were assembled using both the VirMAP pipeline and the SPAdes de novo assembler 20 . Full-length SARS-CoV-2 genomes were obtained from 17 of the 45 capture-enriched samples. Genome coverage in these 17 samples varied from 1071x to 3.19x million (Supplementary Table 1 ). Successful full-length genome assembly was correlated with Ct values below 33 (Fig. 3 ), regardless of the total reads generated during sequencing. Two samples with Ct values above 33, 192000296 (Ct 33.9) and 192000354 (Ct 35.5), obtained from a single patient, also yielded full-length genome reconstructions with acceptable quality (N ≤ 0.5%). Partial genome reconstructions were achieved for the remaining samples although somewhat surprisingly, the correlation between percentage of the genome that was reconstructed and the Ct value of that sample was not tightly correlated when Ct values were above 33 (Fig. 3) . Full-length genome sizes of the 17 capture-enriched and assembled sequences varied from 29.68 Kb to 30.15 Kb ( Supplementary Fig. 2) . Variants relative to the SARS-CoV-2 reference genome sequence NC_045512.2, including single nucleotide polymorphisms and a single indel, ranged from 5 to 15 per sample, with a mean of nine. Out of the nine pre-capture samples, three (192000072B, 192000021B, 1920000003B) , all with Ct values < 27.4, yielded full-length genomes with 28x -265x genome coverage, while in the other four samples, genome reconstructions were partial and also had a poor genome coverage of 1-6x. SARS-CoV-2 reads were not detected in the two remaining samples. To identify and quantitate subgenome-length mRNAs, reads were aligned to the SARS-CoV-2 reference genome NC_045512.2. Only samples with full-length genomes (N=17 capture and N=5 pre-capture) were analyzed for junction reads to avoid introduction of any bias in identifying subgenomic RNA due to gaps in sequence coverage (Fig. 5A and Supplementary Among the five pre-and post-capture comparison pairs, junction reads were identified in more ORFs after capture, and in instances where junction reads were found before and after capture, the expression trend agreed between the two groups. In the capture libraries, junction reads were identified in all 17 samples in the S gene, followed by ORF8 in 16 samples, ORF3a and ORFa in 14 sample samples, N gene in 13 samples, M gene in 11 and ORF6 in 10 samples with remaining ORFs seen in between 3 to 10 samples. Junction reads containing canonical leader sequences were not identified in ORF1ab in any sample, suggesting the translation of ORF1ab from genomic RNA is independent of the canonical leader sequence. The average number of junction reads/million was highest for ORF3a (176.3), followed by ORF8 (104.3) and S gene (10.8). The N gene junction reads/million average was skewed due to its high presence in sample 192000052B. Log transformed values are shown in Fig. 5A . For the remaining genes, the average was less than 10 junction reads/million. The expression of ORF10 gene was detected in three of the 17 samples (192000052B, 192000251B, and 192000440B) with expression values of 0.13, 0.13 and 0.02 reads/million ( Supplementary Fig. 4 ). Among the 17 libraries with full-length genomes, there is only one pair 192000296B (Ct 33.8) and 192000354B (Ct 35.5), sampled twice from the same subject (Patient #12) and the junction read expression was lower but detectable in both of these samples (Supplementary Table 3 ). There were no gaps in the ORF read coverage in any of the 17 capture samples (Fig. 5B ). From 5' to 3' of the genome, there was a gradual increase in the read coverage as expected, for the genomic and subgenomic (transcriptomic) RNA reads. Across the genes in these 17 samples, ORF1ab and ORF3a had the lowest reads per kilobase million (RPKM) values (average 32509 and 27957 RPKM, respectively) while the highest values were seen for ORF10 with a count of 121,643 (Fig. 5B ). We employed a hybridization-based oligonucleotide capture methodology, combined with short DNA read sequencing, for culture-free genome reconstruction and transcriptome characterization of the SARS-CoV-2 virus. The approach provided complete viral genome sequences and identified sub genomic fragments containing ORFs, shedding light on SARS-CoV-2 transcription in clinical samples. This method uses routine cDNA and library preparation along with Illumina sequencing, employing 96 or more barcodes. Patient samples can be pooled for capture and sequencing, to generate sequence data in large numbers. The capture method provided considerable enrichment of SARS-CoV-2 in all samples tested. The enrichment efficiency was calibrated using two spike-in synthetic SARS-CoV-2 RNA controls, and yielded a 91,858-fold enrichment in the 1,500 copy (Ct=36.2), libraries and 13,778-fold enrichment in the 150k copy (Ct=29.6) reconstructed samples. For nine patient samples, where sequence data from pre and post capture libraries were compared, a 9,243-fold enrichment was observed. Some human sequences were observed in the data generated from low viral load samples (CT>33) and these were removed in silico 15 , and did not effectively interfere with the enrichment. Some unevenness in SARS-CoV-2 sequence representation was initially observed when pooling samples within a range of Ct. values. This was managed by pooling groups of samples based upon their range of CT values before capture enrichment. Full length SARS-CoV-2 genomes were able to be assembled from 17 of the 45 samples analyzed. High quality, full-length reconstructions from capture enrichment appears to be reliably achieved with a viral Ct <33. Between a Ct of 33 and 36, the full-length genome is recovered in some samples while partial genomes, consisting of >50% of the genome length, were reconstructed for the majority (Fig. 3 ). This observation is consistent with the results reported for multiplex amplicon-based approaches by CDC 23 where the multiplex PCR strategy is also effective at generating full length genome sequences 99% the length of the SARS-CoV-2 reference genome, NC_045512.2, was considered a fully reconstructed genome. Downstream analysis of the reconstructed genomes was done using various custom Perl, Bash and R scripts. Plots were generated using R (version 3.6.1) and the tidyverse (version 1.3.0) and ggplot2 (version 3.2.1) packages. Alignments and reference mapping were done using mafft 26 (version 1.4.0) and BBMap (version 38.82) . Sequence variation compared to SARS-CoV-2 reference genome was performed using the genome alignment from mafft with in-house scripts. For heterozygous variant analysis, the sequence reads were aligned to the reference genome using BWA-mem 27 with default parameters, realigned using GATK 28 , and variants were called using Atlas-SNP2 29 . Variant annotation was performed with snpEff 30 . Lineage assignment of SARS-COV-2 following Rambaut et al (2020) used the Pangolin COVID-19 Lineage Assigner (https://pangolin.coguk.io). Illumina sequence reads were aligned to SARS-CoV-2 reference genome NC_045512.2 using STAR aligner v2.7.3a 31 with penalties for non-canonical splicing turned off as described by Kim et al 1 . Alignment bam files were parsed using an in-house script to obtain junction-spanning reads that contained the leader sequence Eight of the 17 full-length reconstructed SARS-CoV-2 genomes are available at GISAID ( www.gisaid.org ) under the accession numbers EPI_ISL_444022 and EPI_ISL_445078 -EPI_ISL_445084 and remaining sequences are submitted and under review. T r a c k i n g C h a n g e s i n S A R S -C o V -2 S p i k e : E v i d e n c e t h a t D 6 1 4 G I n c r e a s e s I n f e c t i v i t y o f t h e C O V I D -1 R a p i d , s e n s i t i v e , f u l l g e n o m e s e q u e n c i n g o f S e v e r e A c u t e R e s p i r a t o r y S y n d r o m e V i r u s C o r o n a v i r u s 2 ( S A R S -C o V -2 ) . e v o l u t i o n o f S A R S - C o V - 2 . b i o R x i v , 2 0 2 0 . 2 0 0 3 . 2 0 0 5 . 9 7 6 1 6 7 , d o i : 1 0 . 1 1 0 1 / 2 0 2 0 . 0 3 . 0 5 . 9 7 6 1 6 7 ( 2 0 2 0 ) . 1 0 A l e x a n d e r s e n , S . , C h a m i n g s , A . & B h a t t a , T . R . S A R S - C o V - 2 g e n o m i c a n d s u b g e n o m i c R N A s i n d i a g n o s t i c s a m p l e s a r e n o t a n i n d i c a t o r o f a c t i v e r e p l i c a t i o n . m e d R x i v , 2 0 2 0 . 2 0 0 6 . 2 0 0 1 . 2 0 1 1 9 7 5 0 , d o i : 1 0 . 1 1 0 1 / 2 0 2 0 . 0 6 . 0 1 . 2 0 1 1 9 7 5 0 ( 2 0 2 0 Few examples of those junction reads are shown in the figure Part of this work was supported by the National Institute of Allergy and Infectious Diseases (Grant#1U19AI144297). The authors are grateful to the production teams at HGSC for data generation. None declared. Table 3 . Junction read counts is reads/million identified in the post capture data of 17 samples with full-length genomes. Table 4 . Junction read counts in reads/million identified in the nine samples sequenced before (IDxxxxB-2) and after capture (IDxxxxB) enrichment.