key: cord-0919719-dhde7pi9
authors: Doddapaneni, Harsha; Cregeen, Sara Javornik; Sucgang, Richard; Meng, Qingchang; Qin, Xiang; Avadhanula, Vasanthi; Chao, Hsu; Menon, Vipin; Nicholson, Erin; Henke, David; Piedra, Felipe-Andres; Rajan, Anubama; Momin, Zeineen; Kottapalli, Kavya; Hoffman, Kristi L.; Sedlazeck, Fritz J.; Metcalf, Ginger; Piedra, Pedro A.; Muzny, Donna M.; Petrosino, Joseph F.; Gibbs, Richard A.
title: Oligonucleotide capture sequencing of the SARS-CoV-2 genome and subgenomic fragments from COVID-19 individuals
date: 2020-07-27
journal: bioRxiv
DOI: 10.1101/2020.07.27.223495
sha: d17e976d7acdfa0c5aee3d1789edc08644f81768
doc_id: 919719
cord_uid: dhde7pi9

The newly emerged and rapidly spreading SARS-CoV-2 causes coronavirus disease 2019 (COVID-19). To facilitate a deeper understanding of the viral biology we developed a capture sequencing methodology to generate SARS-CoV-2 genomic and transcriptome sequences from infected patients. We utilized an oligonucleotide probe-set representing the full-length genome to obtain both genomic and transcriptome (subgenomic open reading frames [ORFs]) sequences from 45 SARS-CoV-2 clinical samples with varying viral titers. For samples with higher viral loads (cycle threshold value under 33, based on the CDC qPCR assay) complete genomes were generated. Analysis of junction reads revealed regions of differential transcriptional activity and provided evidence of expression of ORF10. Heterogeneous allelic frequencies along the 20kb ORF1ab gene suggested the presence of a defective interfering viral RNA species subpopulation in one sample. The associated workflow is straightforward, and hybridization-based capture offers an effective and scalable approach for sequencing SARS-CoV-2 from patient samples.

The COVID-19 pandemic has spread worldwide with alarming speed and has led to the worst healthcare crisis in a century. The agent of COVID-19, the novel SARS-CoV-2 coronavirus (family Coronaviridae), has a ~30 Kb positive-sense single-stranded RNA genome predicted to encode ten open reading frames (ORFs) 1 . Similar to other RNA viruses, coronaviruses undergo mutation and recombination 2,3 that may be critical to understanding physiological responses and disease sequelae, prompting the need for comprehensive characterization of multiple and varied viral isolates.

To date, reports highlighting genomic variation of SARS-CoV-2 have primarily used ampliconbased sequencing approaches (e.g., ARTIC) 4-7 . Attaining uniform target coverage is difficult for amplicon-based methods, and is exacerbated by issues of poor sample quality 8 . Genome variation in the amplicon primer region may also impact sequence assembly. Transcriptome characterization can further contribute to our knowledge of mutation within the SARS-CoV-2 genome, and direct RNA long read sequencing, both alone and in combination with short read sequencing, have been described 1, 9, 10 . Unfortunately, these analyses are equally hampered by sample quality limitations and necessitate use of cultured cell lines.

Oligonucleotide capture ('capture') mitigates these issues as hybridization to specific probes not only enriches for target sequences but enables the analysis of degraded source material [11] [12] [13] [14] .

Capture enrichment has also been applied to viral sequencing, where a panvirome probe design resulted in up to 10,000-fold enrichment of the target sequence and flanking regions 15, 16 .

Hybridization-based enrichment of RNAs can also aid in the identification of gene fusions or splice variants 12, 17, 18 , which are particularly important for coronavirus biology. In addition to encoding a polyprotein that undergoes autocatalyzed hydrolysis, coronaviruses employ subgenomic RNA fragments generated by discontinuous transcription to translate proteins required for viral replication and encapsidation. These subgenomic RNA fragments share a common 62-bp leader sequence derived from the 5' end of the viral genome, detectable as a fused junction to interior ORFs 1,9 . Direct RNA sequencing of cultured cell lines infected with SARS-CoV-2 revealed that the junctional sequences are not evenly distributed between the ORFs, suggesting that individual proteins may be translated at different rates 1 . How virus translation profiles from infected human patients differ from those from cultured cells is as yet unknown.

Here we have utilized capture probes and a streamlined workflow for sequence analysis of both the SARS-CoV-2 genomic sequences and of the junction reads contained within the genomic subfragments generated by discontinuous transcription (Fig.1) . The method can be applied at scale to analyze samples from clinical isolates. Enriching for genomic and transcriptional RNA, followed by deep short-read sequencing, sheds light on variation in clinical SARS-CoV-2 genomic sequences and expression profiles.

A total of 45 samples collected from 32 patients between March 18 and April 25, 2020 in Houston, Tx, USA were analyzed. These were a subset of individuals tested for the presence of SARS-CoV-2 early during the pandemic. RNA fractions were isolated from viral transport media and converted to cDNA. SARS-CoV-2 cDNA libraries were pooled into six groups (Supplementary Table 1 ). All 45 capture-enriched and nine of the pre-capture libraries were sequenced on an Illumina platform based on details provided in the online methods. A schematic workflow is shown in Fig.1 .

A total of 7.15 billion raw reads were generated for the 45 SARS-CoV-2 positive samples sequenced (Supplementary Table 1 ). Sequences were trimmed to filter low quality reads and subsequently mapped to the GRCh38 reference genome to identify human reads ( Fig. 2A) .

Trimmed non-human sequence reads were analyzed using the VirMAP 19 pipeline where between 7-86.4% of total reads from post-capture libraries mapped to the SARS-CoV-2 reference. One sample (192000446B), which had only 6.37 ng total RNA starting material, did not generate any SARS-CoV-2 reads. Overall, the percentage of reads represented by SARS-CoV-2 was higher in samples with CDC protocol-based RT-qPCR Ct values <33 ( Fig. 2A) .

To estimate the capture enrichment efficiency, pre-capture libraries for nine samples, ranging in Ct values of 20.4 to 37.95 (i.e. high to low titer in the original samples), were also sequenced, generating 152.1 -322.9 million reads per sample. Samples 192000106B and 192000090B, with Ct > 37 produced zero reads mapping to the SARS-CoV-2 reference genome. In the remaining seven samples, less than 0.022% of reads were deemed SARS-CoV-2 (Fig. 2B) .

Collectively, post-capture enrichment increased the SARS-CoV-2 mapping rate to 50.9%, a 9,243-fold enrichment.

Spiked synthetic SARS-CoV-2 RNA, encompassing six fragments of 5 Kb each, served as a positive control and were enriched successfully at both 1,500 and 150k copies per sample (Fig.   2C ). In the 1,500 copy libraries (n=2), 3-5% of reads mapped to the SARS-CoV-2 genome, while approximately 65% of reads from the 150k copy libraries (n=2) did the same (Supplementary Table 1 ). This translates to an approximate 91,858-fold enrichment in the 1,500 copy libraries and 13,778-fold enrichment in the 150k copy libraries compared to their starting amounts in the RNA. Three SARS-CoV-2 PCR negative samples were also sequenced, where <0.5% of reads mapped to the SARS-CoV-2 reference genome at 3-5 locations that are not conserved in the SARS-CoV-2 genome (Supplementary Table 1 ; Supplementary Fig. 1 ).

In order to assess the ability of the capture methodology to assemble full-length genomes, both the nine pre-capture and 45 post capture libraries were assembled using both the VirMAP pipeline and the SPAdes de novo assembler 20 .

Full-length SARS-CoV-2 genomes were obtained from 17 of the 45 capture-enriched samples.

Genome coverage in these 17 samples varied from 1071x to 3.19x million (Supplementary Table 1 ). Successful full-length genome assembly was correlated with Ct values below 33 (Fig.   3 ), regardless of the total reads generated during sequencing. Two samples with Ct values above 33, 192000296 (Ct 33.9) and 192000354 (Ct 35.5), obtained from a single patient, also yielded full-length genome reconstructions with acceptable quality (N ≤ 0.5%). Partial genome reconstructions were achieved for the remaining samples although somewhat surprisingly, the correlation between percentage of the genome that was reconstructed and the Ct value of that sample was not tightly correlated when Ct values were above 33 (Fig. 3) . Full-length genome sizes of the 17 capture-enriched and assembled sequences varied from 29.68 Kb to 30.15 Kb ( Supplementary Fig. 2) . Variants relative to the SARS-CoV-2 reference genome sequence NC_045512.2, including single nucleotide polymorphisms and a single indel, ranged from 5 to 15 per sample, with a mean of nine.

Out of the nine pre-capture samples, three (192000072B, 192000021B, 1920000003B) , all with Ct values < 27.4, yielded full-length genomes with 28x -265x genome coverage, while in the other four samples, genome reconstructions were partial and also had a poor genome coverage of 1-6x. SARS-CoV-2 reads were not detected in the two remaining samples. 

To identify and quantitate subgenome-length mRNAs, reads were aligned to the SARS-CoV-2 reference genome NC_045512.2. Only samples with full-length genomes (N=17 capture and N=5 pre-capture) were analyzed for junction reads to avoid introduction of any bias in identifying subgenomic RNA due to gaps in sequence coverage (Fig. 5A and Supplementary Among the five pre-and post-capture comparison pairs, junction reads were identified in more ORFs after capture, and in instances where junction reads were found before and after capture, the expression trend agreed between the two groups.

In the capture libraries, junction reads were identified in all 17 samples in the S gene, followed by ORF8 in 16 samples, ORF3a and ORFa in 14 sample samples, N gene in 13 samples, M gene in 11 and ORF6 in 10 samples with remaining ORFs seen in between 3 to 10 samples.

Junction reads containing canonical leader sequences were not identified in ORF1ab in any sample, suggesting the translation of ORF1ab from genomic RNA is independent of the canonical leader sequence. The average number of junction reads/million was highest for ORF3a (176.3), followed by ORF8 (104.3) and S gene (10.8). The N gene junction reads/million average was skewed due to its high presence in sample 192000052B. Log transformed values are shown in Fig. 5A . For the remaining genes, the average was less than 10 junction reads/million. The expression of ORF10 gene was detected in three of the 17 samples (192000052B, 192000251B, and 192000440B) with expression values of 0.13, 0.13 and 0.02 reads/million ( Supplementary Fig. 4 ). Among the 17 libraries with full-length genomes, there is only one pair 192000296B (Ct 33.8) and 192000354B (Ct 35.5), sampled twice from the same subject (Patient #12) and the junction read expression was lower but detectable in both of these samples (Supplementary Table 3 ).

There were no gaps in the ORF read coverage in any of the 17 capture samples (Fig. 5B ). From 5' to 3' of the genome, there was a gradual increase in the read coverage as expected, for the genomic and subgenomic (transcriptomic) RNA reads. Across the genes in these 17 samples, ORF1ab and ORF3a had the lowest reads per kilobase million (RPKM) values (average 32509 and 27957 RPKM, respectively) while the highest values were seen for ORF10 with a count of 121,643 (Fig. 5B ).

We employed a hybridization-based oligonucleotide capture methodology, combined with short DNA read sequencing, for culture-free genome reconstruction and transcriptome characterization of the SARS-CoV-2 virus. The approach provided complete viral genome sequences and identified sub genomic fragments containing ORFs, shedding light on SARS-CoV-2 transcription in clinical samples. This method uses routine cDNA and library preparation along with Illumina sequencing, employing 96 or more barcodes. Patient samples can be pooled for capture and sequencing, to generate sequence data in large numbers.

The capture method provided considerable enrichment of SARS-CoV-2 in all samples tested.

The enrichment efficiency was calibrated using two spike-in synthetic SARS-CoV-2 RNA controls, and yielded a 91,858-fold enrichment in the 1,500 copy (Ct=36.2), libraries and 13,778-fold enrichment in the 150k copy (Ct=29.6) reconstructed samples. For nine patient samples, where sequence data from pre and post capture libraries were compared, a 9,243-fold enrichment was observed. Some human sequences were observed in the data generated from low viral load samples (CT>33) and these were removed in silico 15 , and did not effectively interfere with the enrichment. Some unevenness in SARS-CoV-2 sequence representation was initially observed when pooling samples within a range of Ct. values. This was managed by pooling groups of samples based upon their range of CT values before capture enrichment.

Full length SARS-CoV-2 genomes were able to be assembled from 17 of the 45 samples analyzed. High quality, full-length reconstructions from capture enrichment appears to be reliably achieved with a viral Ct <33. Between a Ct of 33 and 36, the full-length genome is recovered in some samples while partial genomes, consisting of >50% of the genome length, were reconstructed for the majority (Fig. 3 ). This observation is consistent with the results reported for multiplex amplicon-based approaches by CDC 23 where the multiplex PCR strategy is also effective at generating full length genome sequences <Ct of 33.

Capture enrichment enabled identification of a mixed population of SARS-CoV-2 virus in sample 192000051B, including a putative defective interfering viral RNA species that likely is incapable of translating the viral polyprotein encoded in ORF1ab alongside a replication competent strain.

All heterogeneously called alleles are in ORF1ab, the 20 kb gene encoding the polyprotein essential to the replication of the viral genome. Only one of these alleles (T20520C) is expected to produce a synonymous change in the coding sequence. All the other loci are predicted to change the amino acid sequence of the polyprotein. Most notable is T1783A, which introduced a stop codon early in the translation of ORF1ab. Introduced stop codons are rare among the submitted genome assemblies tracking the evolution of SARS-CoV2 (nextstrain.org), but are distributed all along the genome ( Supplementary Fig. 3 ). In some regions, these introduced stop codon alleles occur in multiple loci along multiple lineages, one of which at a significant enough frequency to be scored with high homoplasy 24 . The low phylogenetic signal disqualifies these loci from much further analysis. A stop codon early in the ORF1ab gene should prevent propagation of the virus, but it can possibly be complemented by the presence of a functional copy of the gene from a co-infecting replication competent virus. Differences in expression were noted among these 17 samples suggesting that ORF expression is patient-specific and interestingly, this patient group expression pattern also differed from the profiles reported by Kim et al. 1 ,. Further, evidence of the expression of ORF10 was supported by multiple junction reads in three of our 17 samples (192000052B, 192000251B, and 192000440B) . The SARS-CoV-2 genome coverage in these three samples was among the highest (3,192,285x, 1,196,745x, and 793,028x) , which might have contributed to their discovery (Supplementary. Table 1 ). ORF10 was also undetected in the other transcriptome study by Taiaroa et al., 2020 using ONT and SARS-CoV-2 infected Vero/hSLAM cells. ORF10 1 is 117 bases in length so it may have been missed by these studies due to its low or absent expression in cultured cells. We note however that the capture methodology is limited in its ability to identify the RNA modifications that were reported by the above two direct RNA-Seq methods.

In summary, this capture enrichment and sequencing method provides an effective approach to generate SARS-CoV-2 genome and transcriptome data directly from clinical samples. Samples with Ct values <33, when sequenced to a depth of approximately 2 million reads (higher than 1000x coverage of the SARS-CoV-2 genome), appear to be sufficient for both full genome reconstruction and identification and quantitation of junction-reads to measure differential ORF expression. This method has a straightforward work-flow and is scalable for sequencing large numbers of patient samples. Pool 5 with 11 samples (Ct 33 -38.9) and Pool 6 with 12 samples (Ct 32.9 -39.5). Pooled cDNA pre-capture libraries were hybridized with probes from the SARS-CoV-2 Panel (Twist, Inc) at 70 o C for 16 hours. Total probe length is 120 Kb. Post-capture insert molecules were further amplified (12-16 cycles) to obtain the final libraries that were sequenced on Illumina NovaSeq S4 flow cell, to generate 2x150 bp paired-end reads. To evaluate the effect of hybridizationbased enrichment 9 samples were sequenced before and after capture enrichment. Ribosomal RNA was removed computationally.

Sequence Mapping, genome reconstruction and variant calling: Raw fastq sequences were processed using BBDuk (sourceforge.net/projects/bbmap/; BBMap version 38.82) to quality trim, remove Illumina adapters and filter PhiX reads. Trimming parameters were set to a k-mer length of 19 and a minimum Phred quality score of 25. Reads with a minimum average Phred quality score below 23 and length shorter than 50 bp after trimming were discarded. The trimmed fastqs were mapped to a combined PhiX (standard Illumina spike in) and human reference genome (GRCh38.p13; GCF_000001405.39) database using a two-step BBTools approach (BBMap version 38.82). Briefly, the trimmed reads were first processed through the bloomfilter script, with a strict k=31 to remove reads identified as human. The remaining reads were mapped to the reference genome with BBMap using a k-mer length of 15, the bloom filter enabled, and fast search settings in order to determine and remove hg38/PhiX reads. Trimmed and human-filtered reads were then processed through VirMAP 19 to obtain full length reconstruction of the SARS-CoV-2 genomes. SPAdes assembler 20 was also used for genome reconstruction. The resulting assemblies were compared to those from VirMAP. A reconstructed genome with >99% the length of the SARS-CoV-2 reference genome, NC_045512.2, was considered a fully reconstructed genome. Downstream analysis of the reconstructed genomes was done using various custom Perl, Bash and R scripts. Plots were generated using R (version 3.6.1) and the tidyverse (version 1.3.0) and ggplot2 (version 3.2.1) packages. Alignments and reference mapping were done using mafft 26 (version 1.4.0) and

BBMap (version 38.82) . Sequence variation compared to SARS-CoV-2 reference genome was performed using the genome alignment from mafft with in-house scripts. For heterozygous variant analysis, the sequence reads were aligned to the reference genome using BWA-mem 27 with default parameters, realigned using GATK 28 , and variants were called using Atlas-SNP2 29 .

Variant annotation was performed with snpEff 30 . Lineage assignment of SARS-COV-2 following Rambaut et al (2020) used the Pangolin COVID-19 Lineage Assigner (https://pangolin.coguk.io).

Illumina sequence reads were aligned to SARS-CoV-2 reference genome NC_045512.2 using STAR aligner v2.7.3a 31 with penalties for non-canonical splicing turned off as described by Kim et al 1 . Alignment bam files were parsed using an in-house script to obtain junction-spanning reads that contained the leader sequence 

Eight of the 17 full-length reconstructed SARS-CoV-2 genomes are available at GISAID ( www.gisaid.org ) under the accession numbers EPI_ISL_444022 and EPI_ISL_445078 -EPI_ISL_445084 and remaining sequences are submitted and under review. T  r  a  c  k  i  n  g  C  h  a  n  g  e  s  i  n  S  A  R  S  -C  o  V  -2  S  p  i  k  e  :  E  v  i  d  e  n  c  e  t  h  a  t  D  6  1  4  G  I  n  c  r  e  a  s  e  s  I  n  f  e  c  t  i  v  i  t  y  o  f  t  h  e  C  O  V  I  D  -1 R  a  p  i  d  ,  s  e  n  s  i  t  i  v  e  ,  f  u  l  l  g  e  n  o  m  e  s  e  q  u  e  n  c  i  n  g  o  f  S  e  v  e  r  e  A  c  u  t  e  R  e  s  p  i  r  a  t  o  r  y  S  y  n  d  r  o  m  e  V  i  r  u  s  C  o  r  o  n  a  v  i  r  u  s  2  (  S  A  R  S  -C  o  V  -2 ) . 

e v o l u t i o n o f S A R S - C o V - 2 . b i o R x i v , 2 0 2 0 . 2 0 0 3 . 2 0 0 5 . 9 7 6 1 6 7 , d o i : 1 0 . 1 1 0 1 / 2 0 2 0 . 0 3 . 0 5 . 9 7 6 1 6 7 ( 2 0 2 0 ) . 1 0 A l e x a n d e r s e n , S . , C h a m i n g s , A . & B h a t t a , T . R . S A R S - C o V - 2 g e n o m i c a n d s u b g e n o m i c R N A s i n d i a g n o s t i c s a m p l e s a r e n o t a n i n d i c a t o r o f a c t i v e r e p l i c a t i o n . m e d R x i v , 2 0 2 0 . 2 0 0 6 . 2 0 0 1 . 2 0 1 1 9 7 5 0 , d o i : 1 0 . 1 1 0 1 / 2 0 2 0 . 0 6 . 0 1 . 2 0 1 1 9 7 5 0 ( 2 0 2 0

Few examples of those junction reads are shown in the figure

Part of this work was supported by the National Institute of Allergy and Infectious Diseases (Grant#1U19AI144297). The authors are grateful to the production teams at HGSC for data generation.

None declared. Table 3 . Junction read counts is reads/million identified in the post capture data of 17 samples with full-length genomes. Table 4 . Junction read counts in reads/million identified in the nine samples sequenced before (IDxxxxB-2) and after capture (IDxxxxB) enrichment.