key: cord-0790392-7o9cowsy
authors: Dong, Xiaofeng; Penrice-Randal, Rebekah; Goldswain, Hannah; Prince, Tessa; Randle, Nadine; Salguero, Javier; Tree, Julia; Vamos, Ecaterina; Nelson, Charlotte; Stewart, James P.; Semple, Malcolm G.; Baillie, J. Kenneth; Openshaw, Peter J. M.; Turtle, Lance; Matthews, David A.; Carroll, Miles W.; Darby, Alistair C.; Hiscox, Julian A.
title: Identification and quantification of SARS-CoV-2 leader subgenomic mRNA gene junctions in nasopharyngeal samples shows phasic transcription in animal models of COVID-19 and dysregulation at later time points that can also be identified in humans
date: 2021-03-05
journal: bioRxiv
DOI: 10.1101/2021.03.03.433753
sha: f0c1da22d532a137bc6f694f742d105b10025c3a
doc_id: 790392
cord_uid: 7o9cowsy

Introduction SARS-CoV-2 has a complex strategy for the transcription of viral subgenomic mRNAs (sgmRNAs), which are targets for nucleic acid diagnostics. Each of these sgRNAs has a unique 5’ sequence, the leader-transcriptional regulatory sequence gene junction (leader-TRS-junction), that can be identified using sequencing. Results High resolution sequencing has been used to investigate the biology of SARS-CoV-2 and the host response in cell culture models and from clinical samples. LeTRS, a bioinformatics tool, was developed to identify leader-TRS-junctions and be used as a proxy to quantify sgmRNAs for understanding virus biology. This was tested on published datasets and clinical samples from patients and longitudinal samples from animal models with COVID-19. Discussion LeTRS identified known leader-TRS-junctions and identified novel species that were common across different species. The data indicated multi-phasic abundance of sgmRNAs in two different animal models, with spikes in sgmRNA abundance reflected in human samples, and therefore has implications for transmission models and nucleic acid-based diagnostics.

Various sequencing approaches are used to characterise SARS-CoV-2 RNA synthesis in cell culture 1, 2 , ex vivo models 3 and clinical samples. This can include nasopharyngeal swabs from patients with COVID-19 4 to post-mortem samples from patients who died of severe disease 5 .

Bioinformatic interrogation of this data can provide critical information on the biology of the virus. SARS-CoV-2 genomes are message sense, and the 5' two thirds of the genome is translated and proteolytically cleaved into a variety of functional subunits, many of which are involved in the synthesis of viral RNA 6 . The remaining one third of the genome is expressed through a nested set of subgenomic mRNAs (sgmRNAs). These have common 5' and 3' ends with the coronavirus genome, including a leader sequence. Many studies have shown that the sgmRNA located towards the 3' end of the genome, which encodes the nucleoprotein, generally has a higher abundance than those located immediately after the 1a/b region and the genome itself 7, 8 . However, there is not necessarily a precise transcription gradient of the sgmRNAs. The 5' leader sequence on the sgmRNAs is immediately abutted to a short sequence called a transcriptional regulatory sequence (TRS) that is involved in the control of sgmRNA synthesis 9, 10 . These TRSs are located along the genome and are proximal to the start codons of the open reading frames 11 . In the negative sense the TRSs are complementary to a short portion of the genomic leader sequence. The TRS is composed of a short core motif that is conserved and flanking sequences 9, 10, 12 . For SARS-CoV-2 the core motif is ACGAAC.

The prevailing thought is that synthesis of sgmRNAs involves a discontinuous step during negative strand synthesis 13, 14 . A natural consequence of this is recombination resulting in insertions and deletions in the viral genome and the formation of defective viral RNAs. Thus, the identification of the leader/sgmRNA complexes by sequencing provides information on 4 the abundance of the sgmRNAs and evidence that transcription has occurred in the tissue being analysed. In terms of clinical samples, if infected cells are present, then leader/sgmRNA 'fusion' sequence can be identified, and inferences made about active viral RNA synthesis from the relative abundance of the sgmRNAs. In the absence of human challenge models, the kinetics of virus infection are unknown, and most studies will begin with detectable viral RNA on presentation of the patient with clinical symptoms. In general, models of infection of humans with SARS-CoV-2 assume an exponential increase in viral RNA synthesis followed by a decrease as antibody levels increase 15 .

In order to investigate the presence of SARS-CoV-2 sgmRNAs in clinical (and other) samples, a bioinformatics tool (LeTRS), was developed to analyse sequencing data from SARS-CoV-2 infections by identifying the unique leader-TRS gene junction site for each sgmRNA. The utility of this tool was demonstrated on cells infected in culture, nasopharyngeal samples from human infections and longitudinal analysis of nasopharyngeal samples from two non-human primate models for COVID-19. The results have implications for diagnostics and disease modelling.

A tool, LeTRS (named after the leader-TRS fusion site), was developed to detect and quantify defined leader gene junctions of SARS-CoV-2 (and other coronaviruses) from multiple types of sequencing data. This was used to investigate SARS-CoV-2 sgmRNA synthesis in humans and non-human primate animal models. LeTRS was developed using the Perl programming language, including a main program for the identification of sgmRNAs and a script for plotting graphs of the results. The tool accepts FastQ files derived from Illumina paired-end or Oxford Nanopore amplicon cDNA, Nanopore direct RNA sequencing, or BAM files produced by a splicing alignment method with a SARS-CoV-2 genome (Supplementary Figure 1 ). (Note that SARS-CoV-2 sgmRNAs are not formed by splicing, but this is the apparent observation from sequencing data as a result of the discontinuous nature of transcription). By default, LeTRS analyses SARS-CoV-2 sequence data by using 10 known leader-TRS junctions and an NCBI reference genome (NC_045512.2). However, given the potential heterogeneity in the leader-TRS region and potential novel sgmRNAs, the user can also provide customize leader-TRS junctions and SARS-CoV-2 variants as a reference. The tool was designed to investigate very large data sets that are produced during sequencing of multiple samples. As there is some heterogeneity in the leader-TRS sites, LeTRS was also designed to search the leader-TRS junction in a given interval, report 20 nucleotides at the 3' end of the leader sequence, the TRS and translated first orf of the sgmRNA, and find the conserved ACGAAC sequences in the TRS.

Combinations of read alignments with the leader-TRS junction that are considered for identifying leader-TRS junction sites 6 Various approaches have been used to sequence the SARS-CoV-2 genome and in most cases, this would also include any sgmRNAs as they are 3' co-terminal and share common sequence extending from the 3' end. Methods such as ARTIC 16 and RSLA 4 use primer sets to generate overlapping amplicons that span the entire genome, and also amplify sgmRNA. Included is a primer to the leader sequence, so that the unique 5' end of these moieties are also sequenced.

Primer sets of ARTIC and RSLA are generally pooled. Unbiased sequencing can also be used in methodologies to identify SARS-CoV-2 sequence. Data in the GISAID database have been generated by Oxford Nanopore (minority) or Illumina (majority) based approaches. These can give different types of sequencing reads derived from the sgmRNAs that can be mapped back on the reference SARS-CoV-2 genome by splicing alignment ( Figure 1A ). For example, there are a number of different types of reads that can be derived from mapping Illumina-based amplicon sequencing onto the reference viral genome ( Figure 1B and 1C) . During the PCR stage, the extension time allows the leader-TRS region on the sgmRNAs to be PCR-amplified by the forward primer and the reverse primer before and after leader-TRS junction in different primer sets, respectively. Both of these forward and reverse and primers would be detected at both ends of each paired read ( Figure 1B pink lines) if the amplicon had a length shorter than the Illumina read length (usually 100-250 nts). If the amplicon was longer than the Illumina read length, primer sequence would be only found at one end of each paired read ( Figure 1B green lines). The extension stage could also proceed with a single primer using cDNA from the sgmRNA as template. This type of PCR has a very low amplification efficiency, but theoretically could also generate the same Illumina paired end reads with just one primer sequence at one end ( Figure 1C ). These paired end reads could include the full length of the leader sequence but might not reach the 3' end of the sgmRNA, because of the limitation of 7 Illumina sequencing length and extension time ( Figure 1C ). Also, unless there are cryptic TRSs, all sgmRNAs would be expected to be larger than the Illumina sequencing length.

In contrast, the different types of read alignment in the Nanopore based cDNA sequencing are simpler to assign. The longer reads that tend to be generated by Nanopore sequencing (depending on optimisation) enable the capture of full-length sequences of all amplicons.

Provided the leader sequence is included as a forward primer most of the reads spanning the leader-TRS junction would contain the forward and reverse primer sequences at both ends ( Figure 1D pink lines). If the extension time allowed, single primer PCR amplification could take the Nanopore cDNA sequencing reads to both the 3' and 5' ends of the sgmRNAs, and these types of reads would only have a primer sequence at one end ( Figure 2D brown lines).

In the Nanopore direct RNA sequencing approach, the full length sgmRNA could be sequenced and mapped entirely on the leader and TRS-orf regions ( Figure 1E ).

In order to assess the ability of LeTRS to identify the leader-TRS junctions from sequencing information, the tool was first evaluated on sequence data obtained from published SARS-CoV-2 infections in cell culture and our laboratory experiments conducted for this study. First, published data was used from sequencing viral RNA at 72 hrs post-infection in a cell culture model 16 . SARS-CoV-2 was sequenced using Nanopore from an amplicon-based approach (ARTIC) 16 (Figure 2A , Table 1 and 2, Supplementary Tables 1 and 2 ). All of the major known leader-TRS gene junctions were identified. Interestingly, the nucleoprotein gene leader-TRS was approximately the same abundance as the membrane leader-TRS, whereas the other leader-TRSs were much lower. Two potential novel leader-TRS gene junctions were identified 8 at positions 21,055 and 28,249 ( Figure 2A , Table 2, Supplementary Table 2 ). The former is within the orf1b region and the latter within orf8. Second, data was analysed from a published experiment of cells infected ( Figure 2B , Table 3 and 4, Supplementary Tables 3 and 4) and control sample (Tables 5 and 6 ) in culture using a direct RNA sequencing approach 2 . Analysis demonstrated a more expected pattern of abundance of the leader-TRS gene junctions ( Figure 2B , Table 3 and Supplementary Table 3 ). The leader-TRS nucleoprotein gene junction was most abundant, and in general, the patten of abundance of the leader-TRS gene junctions for the major structural proteins followed the order of the gene junction along the genome.

Novel low abundance leader-TRS gene junctions were also identified. One of these low abundance leader-TRSs gene junctions was also common to those found by the ARTIC amplicon analysis (Figure 2A and B, Table 2 and 4, Supplementary Table 2 and 4). Third, LeTRS was evaluated on sequencing data obtained from VeroE6 cells infected in culture with SARS-CoV-2 (SCV2-006). Here viral RNA, that had been prepared at 24 hrs post-infection, was amplified using the ARTIC approach and sequenced by Illumina ( Figure 2C , Table 7 and 8,   Supplementary Tables 5 and 6 ). As would be predicted the major leader-TRS gene junctions were identified, with the nucleoprotein one being the most abundant. Novel potential leadergene junctions were also identified, including three that were greater in abundance than the other leader-gene junction. Some of the novel leader-TRS gene junctions from these three cell culture data sets shared the same first orf with the known sgmRNAs (Supplementary Tables 2, 4 and 6) .

Periscope v0.08a is another tool that was developed to identify sgmRNA from Illumina and Nanopore ARTIC amplicon sequencing data 17 . The tool functions based on searching a 32 nt 9 leader sequence (genomic position: 34-65) and anchoring the known TRS-orf boundaries on the reads for identification of known sgmRNAs. Periscope does not take into consideration the sequences and distance between the leader and TRS-orf boundaries. Periscope can analyse ARTIC amplicon sequencing data, whereas LeTRS can also input a variety of different amplicon and direct RNA sequencing data. Given the very large sequencing datasets being generated as part of the global effort to sequence SARS-CoV-2 the performance of LeTRS was compared to Periscope in terms of computation time. Illumina sequencing data from a nasopharyngeal sample of a human patient with COVID-19 and Nanopore ARTIC amplicon published cell culture data sets were used for comparison. This used the number of reads with at least one primer sequence at either end in the LeTRS and the number of "High Quality" reads (the reads with both 32 nts leader sequences and known TRS-orf boundary) in Periscope.

Periscope was run with the default setting 17 . Both LeTRS and Periscope identified a similar number of reads for both data sets (Supplementary Figure 2 , Tables 1 and 3 and   Supplementary table 7) . With 16 CPU cores, the run times for LeTRS was 1m 40.692s and 2m

14.911s for these tested Illumina and Nanopore data sets, respectively, and for Periscope these were 7m 31.183s and 16m 49.448s. We also tested the data from ARTIC Illumina cell culture data, but Periscope had an error.

Analysis of sequencing data from longitudinal nasopharyngeal samples taken from two non-human primate models of COVID-19 indicated multi-phasic sgmRNA synthesis.

Part of the difficulty of studying SARS-CoV-2 and the disease COVID-19 is establishing the sequence of events from the start of infection. Most samples from humans are from nasopharyngeal aspirates taken when clinical symptoms develop. This tends to be 5 to 6 days post-exposure. In the absence of a human challenge model, animal models can be used to study the kinetics of SARS-CoV-2 18, 19 . Two separate non-human primate models, cynomolgus and rhesus macaques, were established for the study of SARS-CoV-2 that mirrored disease in the majority of humans 18 . To study the pattern of sgmRNA synthesis over the course of infection, nasopharyngeal samples were sequentially gathered daily from one day postinfection up to 18 days post-infection from the two NHP models. RNA was purified from these longitudinal samples as well as the inoculum virus and viral RNA sequenced using the ARTIC approach on the Illumina platform.

Analysis of the sequence data from the inoculum used to infect the NHPs indicated that leader 

The sequencing data spanning cell culture infection, animal models and clinical samples from humans indicated the presence of novel leader-TRS gene junctions. Their detection generally 12 increased with depth of coverage. Coronavirus replication and transcription is promiscuous, and recombination is a natural result of this, resulting in insertions, deletions and potential gene rearrangements. Many of these novel leader-TRS junctions were centred around the known gene open reading frame but out of the search interval. These type of leader-TRS gene junctions could be only found with spike, membrane, ORF6, ORF7b and nucleocapsid orfs, in which the membrane orf was the most common ( Figure 6A ). In order to define what might be genuine novel leader-TRS gene junctions, these were compared across the data in all Illumina ARTIC data ( Figure 6B , Supplementary Table 11 ). This identified 5 novel leader-TRS junctions that were common to all the data, the majority of these being focused on the membrane orf.

13

Coronavirus sgmRNAs are only synthesised during infection of cells and therefore their presence in sequence data can be indicative of active viral RNA synthesis. The abundance of the sgmRNAs in infected cells should follow a general pattern where the sgmRNA encoding the nucleoprotein is the most abundant. Identification and quantification of the unique leader-TRS gene junctions for each sgmRNA can be used as a proxy for their abundance.

LeTRS was developed to interrogate sequencing datasets to identify the leader-gene junctions present at the 5' end of the sgmRNAs. LeTRS was first evaluated and validated on cell culture data from published datasets 2, 16 and from a cell culture experiment as part of this study and then used in an analysis of nasopharyngeal samples from non-human primate and human clinical samples. The results showed that the positions of the leader-TRS junction sites with peak read counts were same as the given reference positions. The exception was at leadergene junction for orf7b in the Nanopore sequencing. The normalized count results confirmed the reads spanning the junctions showed that the leader-TRS nucleoprotein gene junction was the most abundant, and orf7b and orf10 were the most infrequent in line with other data 2, 20 . Several low abundant leader-TRS junctions were identified in all of the datasets with the implication these were either from potential lower abundant novel sgmRNAs, or represented known sgmRNAs, but with different leader-TRS junctions. Likewise, at low frequency these could represent an aberrant viral transcription process or artefacts of the different sequencing processes -although this latter possibility is less likely through the published direct RNA sequencing approach 2 ( Figure 2B) . Traditionally, such sgmRNAs have been first identified in coronaviruses by either northern blot and/or metabolic labelling 8 .

Several other groups have identified novel leader-TRS gene junctions and potential 14 subgenomic mRNAs for other coronaviruses, including avian infectious bronchitis virus 21 . The best way of validating potential novel sgmRNAs would be through matching proteomic data to confirm genuine open reading frames 1 . Analysis of several published sequencing datasets identified novel viral RNA molecules that the authors suggested were sgmRNAs containing only the 5' region of orf1a 22 . Such species are likely to be defective RNAs, that act as templates for replication. Interestingly, they hypothesize that at later time points post-infection in cell culture potential novel sgRNAs are generated non-specifically 22 , which potentially ties in with a disconnect of leader-TRS gene junctions observed in our study both in vivo from the nasopharyngeal samples from latter time points in the NHP models and in humans and from the published data from SARS-CoV-2 infections in cell culture gathered at later time points compared to earlier time points 2, 16 .

Advanced filtering can improve the confidence of the identified leader-TRS junction in the sequencing reads. Amplicon sequencing provided a unique opportunity to filter the sequencing reads. The reads spanning the junctions with the correct forward primer, reverse primer or both primer sequences at the ends of reads proved the known/novel sgmRNA existing in tested Illumina and Nanopore ARTIC v3 primers amplicon sequencing data (Tables   1 and 3) . For Illumina sequencing, the same junction on paired reads with at least a primer provided extra evidence for leader-TRS identification. Some reads were identified that did not have primer sequences and these were likely to be miss-mapped, from template sgmRNA or low-quality sequence. These were present at very low abundance compared to authentic mapped reads (Tables 1, 3 and 5) . No reads with polyA were detected in the Nanopore amplicon sequencing data, this was likely because the limited PCR extension time restricted the primers to reach the 5'end of subgenomic mRNA (Table 1 and Supplementary Table5).

The Nanopore direct RNA sequencing had the potential to generate full length mRNA sequences. The polyA sequences and leader-TRS junctions in the reads can be good signals to prove the full length sgmRNA in the test data (Tables 3 and 4 ). Because the fortuitous sequencing of some host mRNA may lead false positive result, LeTRS was tested against sequence data from uninfected controls cells 2 . No positive reads were found in this control sample (Tables 5 and 6 ), suggesting the LeTRS could effectively screen out or not recognise any false positives. Crucially, LeTRS used less CPU runtime and provided more detailed information than other tools to investigate this 17 , and therefore is suited for the high throughput analysis of large amounts of diverse sequencing data.

In terms of clinical samples (typically nasopharyngeal swabs), the presence of sgmRNAs will be due to the presence infected cells. In general, this has been seen as indicative of active viral RNA synthesis at the time of sampling 5, 23, 24 , although these have also been postulated to be present through resistant structures after infection has finished 25 . Analysis of inoculum indicated that leader-TRS gene junctions could be identified (Supplementary Figure 3 ) but that these were not in the same ratio as found in cells infected in culture (e.g., Figure 2B Likewise, assuming equivalency between the targets, if the nucleoprotein target is found to be more abundant than the spike target than the genomic target, then this would suggest infected cells are present in the sample. Decreases in Ct values associated with emerging variants could equally be explained by sloughed cells being present in a nasopharyngeal sample as well as by increases in the amount of virions/viral load. Therefore, we would caution that a decrease in Ct associated with RT-qPCR based assays may not just be reflective of higher viral loads but also may be indicative of more infected cells being present. These possibilities may be resolved by considering the relative ratios of sgmRNAs identified.

LeTRS was designed to analyse FastQ files derived from Illumina paired-end or Nanopore sequencing data derived from a SARS-CoV-2 amplicon protocol, or standard Nanopore SARS-CoV-2 direct RNA sequencing data (Figure 1 ). The Illumina/Nanopore FastQ sequencing data were cleaned to remove adapters and low-quality reads before input. Sequencing data derived from other sequencing modes or platforms can also be analysed by LeTRS via input of a BAM file produced by a custom splicing alignment method with a SARS-CoV-2 genome (NC_045512.2) as a reference (Figure 1 ). This can also be rapidly adapted for other coronaviruses.

We sequenced the 15 samples from human patients with nanopore. Total RNA was isolated using a QIAamp Viral RNA Mini Kit (Qiagen, Manchester, UK) by spin-column procedure according to the manufacturer's instructions. Clinical samples were extracted with Trizol LS as described 4 . All RNA samples were treated with Turbo DNase (Invitrogen). SuperScript IV (Invitrogen) was used to generate single-strand cDNA using random primer mix (NEB, Hitchin, UK). ARTIC V3 PCR amplicons from the single-strand cDNA were generated following the The option "−O 3" was set, so the that 3' end of any reads which matched the adapter sequence with greater than 3 bp was trimmed off. The reads were further trimmed to remove low quality bases, using Sickle v1.200 27 with a minimum window quality score of 20. After trimming, reads shorter than 10 bp were removed.

The LeTRS was also tested with a combined Nanopore cDNA ARTIC v3 amplicon dataset of 7 published viral cell culture samples (barcode01-barcode07) 16 , and a dataset from a published direct RNA Nanopore sequencing analysis Vero cells infected with SARS-CoV-2 or an uninfected negative control 2 .

LeTRS controlled Hisat2 v2.1.0 28 to map the paired-end Illumina reads against the SARS-CoV-2 reference genome (NC_045512.2) with the default setting, and Minimap2 v2.1 29 to align the Nanopore cDNA reads and direct RNA-seq reads on the viral genome using Minimap2 with "-ax splice" and "-ax splice -uf -k14" parameters, respectively. LeTRS provided 10 known leader-TRS junctions to improve alignment accuracy by using "--known-splicesite-infile" function in Hisat2 and "--junc-bed" function in Minimap2, but this application could be optionally switched off by users. In order to remove low mapping quality and mis-mapped reads before searching the leader-TRS junction sites, LeTRS used Samtools v1.9 30 to have basic filtering for the reads in the output Sam/Bam files according to their alignment states as shown (Table 9 -basic filtering).

After the mapping and basic filtering step, LeTRS searched aligned reads spanning the leader- (Table S1-S6) .

Based on the alignment possibilities illustrated in Figure 2 and discussed, LeTRS further filters the identified reads with known and novel leader-TRS junctions. This step is named as advance filtering and can only applied when the input data is from Illumina paired end reads, Nanopore cDNA reads or Nanopore RNA reads (Table 9 ). If a BAM file is used as input data, the advanced filtering step would be automatically skipped ( Table 9 ). The number of reads including the known and novel leader-TRS junctions, and the number of reads filtered with corresponding advance filtering criteria were outputted into two tables in tab format (Tables   1-6) .

LeTRS-plot was developed as an automatic plotting tool that interfaces with the R package ggplot2 v3.3.3 to view the leader-TRS junctions in the tables generated by LeTRS (Figure 3 -5).

The plot shows peak count, filtered peak count, normalized peak count and normalized filtered peak count for known leader-TRS junctions, and novel junction counts, filtered novel junction count, normalized novel junction count and filtered normalized novel junction for novel leader-TRS junctions. 

Not applicable

LeTRS is available at https://github.com/xiaofengdong83/LeTRS.

Illumina and nanopore test data sets are available under NCBI PRJNA699398.

The authors declare that they have no competing interests with other authors involved in editing the final version. Table 1 . The LeTRS output table for known sgmRNA in the tested Nanopore ARTIC v3 primers amplicon sequencing data. "ref_leader_end" and "peak_leader_end" point to the reference position of the end of leader and the position of the end of leader identified in the most common reads (peak count) on the reference genome, and "ref_TRS_start" and "peak_TRS_start" refer to the reference position of the start of TRS and the position of the start of TRS identified in the most common reads (peak count) on the reference genome. The numbers in the bracket are (reads with left primers, reads with right primers, reads with both primers, reads with > 1 poly A, reads with > 5 poly A).

Normalized count=(Read count-Total number of read mapped on reference genome)*1000000.

Total number of read mapped on reference genome is 2098410, excluding the mapped reads not primary alignment and supplementary alignment.

36 Table 2 . The LeTRS output table for novel sgmRNA in the tested Nanopore ARTIC v3 primers amplicon sequencing data. "leader_end" and "TRS_start" refer to the position of the end of leader and the position of the start of TRS identified in the reads >10.

subgenome leader_end TRS_start nb_count normalized_count 1 74 21055 15(13,15,13,0,0) 7.15 (6.20,7.15,6.20,0 .00,0.00) 2 52 28249 13(13,12,12,0,0) 6.20 (6.20,5.72,5.72,0.00,0.00) The numbers in the bracket are (reads with left primers, reads with right primers, reads with both primers, same junction on paired reads with at least a primer).

Normalized count=(Read count/Total number of read mapped on reference genome)*1000000.

Total number of read mapped on reference genome is 2098410, excluding the mapped reads unpaired, not primary alignment and supplementary alignment. Table 3 . The LeTRS output table for known sgmRNA in the tested Nanopore direct RNA sequencing data. "ref_leader_end" and "peak_leader_end" point to the reference position of the end of leader and the position of the end of leader identified in the most common reads (peak count) on the reference genome, and "ref_TRS_start" and "peak_TRS_start" refer to the reference position of the start of TRS and the position of the start of TRS identified in the most common reads (peak count) on the reference genome. The numbers in the bracket are (reads with > 1 poly A, reads with > 5 poly A).

Normalized count=(Read count/Total number of read mapped on reference genome)*1000000.

Total number of read mapped on reference genome is 575596, excluding mapped reads on reverse strand, not primary alignment and supplementary alignment. (11,4) 19.11(19.11,6.95) The numbers in the bracket are (reads with > 1 poly A, reads with > 5 poly A).

Normalized count=(Read count/Total number of read mapped on reference genome)*1000000.

Total number of read mapped on reference genome is 575596, excluding mapped reads on reverse strand, not primary alignment and supplementary alignment. Table 5 . The LeTRS output table for known sgmRNA in the negative control of the tested nanopore direct RNA sequencing data. "ref_leader_end" and "peak_leader_end" point to the reference position of the end of leader and the position of the end of leader identified in the most common reads (peak count) on the reference genome, and "ref_TRS_start" and "peak_TRS_start" refer to the reference position of the start of TRS and the position of the start of TRS identified in the most common reads (peak count) on the reference genome. The numbers in the bracket are (reads with > 1 poly A, reads with > 5 poly A).

Normalized count=(Read count/Total number of read mapped on reference genome)*1000000.

Total number of read mapped on reference genome is 0, excluding mapped reads on reverse strand, not primary alignment and supplementary alignment.

40 Table 6 . The LeTRS output table for novel sgmRNA in the negative control of the tested nanopore direct RNA sequencing data. "leader_end" and "TRS_start" refer to the position of the end of leader and the position of the start of TRS identified in the reads >10.

subgenome leader_end TRS_start nb_count normalized_count

The numbers in the bracket are (reads with > 1 poly A, reads with > 5 poly A).

Normalized count=(Read count/Total number of read mapped on reference genome)*1000000.

Total number of read mapped on reference genome is 0, excluding mapped reads on reverse strand, not primary alignment and supplementary alignment. amplicon sequencing data. "ref_leader_end" and "peak_leader_end" point to the reference position of the end of leader and the position of the end of leader identified in the most common reads (peak count) on the reference genome, and "ref_TRS_start" and "peak_TRS_start" refer to the reference position of the start of TRS and the position of the start of TRS identified in the most common reads (peak count) on the reference genome. Table 11 ).

Supplementary data Table S1 . The LeTRS output table for details of known sgmRNA in the tested Nanopore ARTIC v3 primers amplicon sequencing data. "peak_leader" and "peak_TRS_start" point to the leader-TRS junctions in Table 1 , "ACGAAC" indicates if there is a ACGAAC sequence in the "TRS_seq" (TRS sequences), "20_leader_seq" refers to the 20 nucleotides before the end of leader, and "AUG_postion" and "first_orf_aa" refer to the first AUG position and translated orf of the sgmRNA. Table S2 . The LeTRS output table for details of novel sgmRNA in the tested Nanopore ARTIC v3 primers amplicon sequencing data. "peak_leader" and "peak_TRS_start" point to the leader-TRS junctions in Table 2 , "ACGAAC" indicates if there is a ACGAAC sequences in the "TRS_seq" (TRS sequences), "20_leader_seq" refers to the 20 sequences before the end of the leader, "AUG_postion" and "first_orf_aa" refer to the first AUG position and translated orf of the sgmRNA, and "known_ATG" indicates if the first AUG position is the same as a known sgmRNA. Table S3 . The LeTRS output table for details of known sgmRNA in the tested Nanopore direct RNA sequencing data. "peak_leader" and "peak_TRS_start" point to the leader-TRS junctions in Table 3 , "ACGAAC" indicates if there is a ACGAAC sequence in the "TRS_seq" (TRS sequences), "20_leader_seq" refers to the 20 nucleotides before the end of leader, and "AUG_postion" and "first_orf_aa" refer to the first AUG position and translated orf of the sgmRNA. Table S4 . The LeTRS output table for details of novel sgmRNA in the tested nanopore direct RNA sequencing data. "peak_leader" and "peak_TRS_start" point to the leader-TRS junctions in Table 4 , "ACGAAC" indicates if there is a ACGAAC sequences in the "TRS_seq" (TRS sequences), "20_leader_seq" refers to the 20 nucleotides before the end of leader, "AUG_postion" and "first_orf_aa" refer to the first AUG position and translated orf of the sgmRNA, and "known_AUG" indicates if the first AUG position is the same as a known sgmRNA. primers amplicon sequencing data. "peak_leader" and "peak_TRS_start" point to the leader-TRS junctions in Table 7 , "ACGAAC" indicates if there is a ACGAAC sequence in the "TRS_seq" (TRS sequences), "20_leader_seq" refers to the 20 nucleotides before the end of leader, and "ATG_postion" and "first_orf_aa" refer to the first AUG position and translated orf of the sgmRNA. primers amplicon sequencing data. "peak_leader" and "peak_TRS_start" point to the leader-TRS junctions in Table 8 , "ACGAAC" indicates if there is a ACGAAC sequences in the "TRS_seq" (TRS sequences), "20_leader_seq" refers to the 20 nucleotides before the end of leader, "AUG_postion" and "first_orf_aa" refer to the first AUG position and translated orf of the sgmRNA, and "known_AUG" indicates if the first AUG position same as a known sgmRNA.

64 Table S7 . Leader-TRS gene junctions of reads with at least one primer sequence derived from sequence data from 15 human patients sequenced with the ARTIC pipeline via Illumina. Table S8 . Leader-TRS gene junction count in the infecting SARS-CoV-2 inoculum source used for the NHP study, sequenced by Illumina ARTIC method. Table S9 . Analysis of leader TRS-gene junction, abundance of reads with at least one primer sequence at either end in longitudinal nasopharyngeal samples taken from two non-human primate models (cynomolgus and rhesus macaques) of SARS-CoV-2 in groups. SARS-CoV-2 was amplified using the ARTIC approach and sequenced by Illumina. The data is organised into groups of animals for the cynomolgus macaque groups 1 and 2 that were with "-1" and "-2" in the excel sheets. Table S10 . leader-TRS gene junctions of reads with at least one primer sequence derived from sequence data from 15 human patients sequenced with the ARTIC pipeline via Nanopore. 

Characterisation of the transcriptome and proteome of SARS-CoV-2 reveals a cell passage induced in-frame deletion of the furin-like cleavage site from the spike glycoprotein

The Architecture of SARS-CoV-2 Transcriptome

A Comparison of Whole Genome Sequencing of SARS-CoV-2 Using Amplicon-Based Sequencing, Random Hexamers, and Bait Capture

Amplicon-Based Detection and Sequencing of SARS-CoV-2 in

Nasopharyngeal Swabs from Patients With COVID-19 and Identification of Deletions in the Viral Genome That Encode Proteins Involved in Interferon Antagonism

Tissue-Specific Immunopathology in Fatal COVID-19

SARS coronavirus replicase proteins in pathogenesis

Genome structure and transcriptional regulation of human coronavirus NL63

Quantification of individual subgenomic mRNA species during replication of the coronavirus transmissible gastroenteritis virus

Investigation of the control of coronavirus subgenomic mRNA transcription by using T7-generated negative-sense RNA transcripts

Nabil-Fareed Alikhan

Chris Ruis, Christine M. Sambles, Fei Sang

The members of the ISARIC4C consortia are: Consortium Lead Investigator

Openshaw; ISARIC Clinical Coordinator: Gail Carson

Vanessa Sancho

Data and Information System Managers

27) ORF6 69 69

110) 2.40(2.40,2.34,2.34,2.30) 59

The numbers in the bracket are (reads with left primers, reads with right primers, reads with both primers, reads with > 1 poly A, reads with > 5 poly A)

The criteria of basic and advanced filtering for four different types of input data for

Supplementary Figure 1. Bioinformatics pipeline for the identification of leader-TRS junctions in sequencing data from SARS-CoV-2 infected material with LeTRS. This can be rapidly adapted for other coronaviruses. LeTRS can work from Nanopore or Illumina amplicon data or more unbiased approaches such as metagenomic or Illumina sequencing by using a BAM file

Supplementary Figure 2. Comparison of LeTRS to Periscope with the Illumina (left) and

Nanopore (right) ARTIC amplicon sequencing test data sets by using the number of reads with at least one primer sequences at either end in LeTRS and the number of "High Quality" reads (the reads with both 32 nts leader sequences and known TRS-orf boundary) in Periscope

Raw (A and C) and normalised (B and D) expected (upper) and novel (lower) leader-TRS gene junctions count in the infecting SARS-CoV-2 inoculum source used for NHP study, sequenced by Illumina ARTIC method

Novel leader-TRS gene junctions identified for cynomolgus macaques

Novel leader-TRS gene junctions identified in nasopharyngeal swabs from human patients sequenced using the ARTIC-Illumina approach

Supplementary Figure 6. Novel leader-TRS gene junctions identified in nasopharyngeal swabs from human patients sequenced using the ARTIC-Nanopore approach

We would like to thank all members of the Hiscox Laboratory and the Centre for GenomeResearch for supporting SARS-CoV-2/COVID-19 sequencing research.