key: cord-0694741-snh629a5
authors: Sashittal, Palash; Zhang, Chuanyi; Peng, Jian; El-Kebir, Mohammed
title: Jumper Enables Discontinuous Transcript Assembly in Coronaviruses
date: 2021-02-15
journal: bioRxiv
DOI: 10.1101/2021.02.12.431026
sha: e0a5eaaaaea827900bbe1b988bb1219a1a70d345
doc_id: 694741
cord_uid: snh629a5

Genes in SARS-CoV-2 and, more generally, in viruses in the order of Nidovirales are expressed by a process of discontinuous transcription mediated by the viral RNA-dependent RNA polymerase. This process is distinct from alternative splicing in eukaryotes, rendering current transcript assembly methods unsuitable to Nidovirales sequencing samples. Here, we introduce the Discontinuous Transcript Assembly problem of finding transcripts and their abundances c given an alignment under a maximum likelihood model that accounts for varying transcript lengths. Underpinning our approach is the concept of a segment graph, a directed acyclic graph that, distinct from the splice graph used to characterize alternative splicing, has a unique Hamiltonian path. We provide a compact characterization of solutions as subsets of non-overlapping edges in this graph, enabling the formulation of an efficient mixed integer linear program. We show using simulations that our method, Jumper, drastically outperforms existing methods for classical transcript assembly. On short-read data of SARS-CoV-1 and SARS-CoV-2 samples, we find that Jumper not only identifies canonical transcripts that are part of the reference transcriptome, but also predicts expression of non-canonical transcripts that are well supported by direct evidence from long-read data, presence in multiple, independent samples or a conserved core sequence. Jumper enables detailed analyses of Nidovirales transcriptomes. Code availability Software is available at https://github.com/elkebir-group/Jumper

SARS-CoV-2 is part of the taxonomic order of Nidovirales, which comprise enveloped viruses containing a positive-sense, single-stranded RNA genome that encodes for non-structural proteins near the 5' end as well as structural and accessory proteins near the 3' end [1] . Since the ribosome starts at the 5' end, translation of the viral genome only generates the non-structural proteins. Expression of the other genes is achieved by discontinuous transcription performed by the viral RNA-dependent RNA polymerase (RdRp) [2] , which is present in the non-structural part of the viral genome. Specifically, RdRp may 'jump' over contiguous genomic regions, or segments, in the viral RNA template. This process results in discontinuous transcripts that have matching 5' and 3' ends and each correspond to a subsequence of segments ordered as in the reference genome. In a recent paper, Kim et al. [3] analyzed sequencing samples showing that SARS-CoV-2 has both canonical discontinuous transcription events that produce an intact 3' open reading frame (ORF) as well as non-canonical discontinuous transcription events whose role is unclear. To understand the life cycle of Nidovirales, it is critical to assemble the complete set T of expressed transcripts and their abundances c (Fig. 1A) .

There has been a lot of work in methods for transcript assembly from RNA-sequencing data. Briefly, there are two classes of methods, (i) de novo assembly methods and (ii) reference-based methods. The main distinction is that reference-based methods require the reference genome as input while de novo methods have no such requirement. As such, de novo assembly methods [4] [5] [6] [7] [8] are useful when the reference genome is unavailable or when the diversity of different species in the sample is too large. On the other hand, reference-based methods [9] [10] [11] [12] [13] are able to achieve higher accuracy as the reference genome provides a scaffold on which to align sequencing reads. Importantly, current methods from both classes are designed to assemble transcripts in eukaryotes, where transcripts expressed by a gene may differ in their composition due to alternative splicing. Alternative splicing is predominantly mediated by the spliceosome, which results in the removal of introns, defined by conserved splice sites, from the transcribed pre-mRNA. This is a distinct process compared to discontinuous transcription in Nidovirales, which is mediated by viral RdRp, which results in the removal of contiguous segments due to jumps of the RdRp at locations that are not fully characterized. As such, current methods for transcript assembly that have been mainly designed for use in eukaryotes are not optimized for transcript assembly in Nidovirales.

In this study, we introduce the DISCONTINUOUS TRANSCRIPT ASSEMBLY problem of finding discontinuous transcripts T and their abundances c (Fig. 1b) given an alignment R. Underpinning our approach is the concept of a segment graph (Fig. 1c) , a directed acyclic graph that, distinct from the splice graph used where Pr(R | T , c) is the probability of observed reads R from given transcripts T with proportions c.

Using the Bayes theorem and considering that all the reads are independent of one another, we get,

Pr(ri | Tj) Pr(Tj),

where Pr(ri | Tj) is the probability of observing read ri given that it is generated from transcript Tj and Pr(Tj) is the probability of generating a read from transcript Tj. The marginalization over the set of transcripts T is necessary because of the ambiguity of the transcript of origin of any given read.

Make an observation about core sequences, the need for clustering. this does not happen in regular transcript assembly, where exon are non-overlapping in terms of sequence.

What are discontinuous edges? Core sequences -there will be ambiguity unlike in alternative splice where exons are non-overlapping.

Notable differences: matching core sequences leads to set of distinct discontinuous edges that originate from the same transcript. • Panel (f): Split by abundance/expression (low, medium, high). 10 10%

3'

3'

3'

3'

3' 5% 5 ' 3' Figure 1 : (a) Viruses in the order Nidovirales generate a set T of discontinuous transcripts with varying abundances c during infection. (b) Next generation sequencing will produce an alignment R with two types of aligned reads:

unphased reads that map to a contiguous genomic region (black) and phased reads that map to distinct genomic regions (red). (c) From R we obtain the segment graph G, a directed acyclic graph with a unique Hamiltonian path.

JUMPER solves the DISCONTINUOUS TRANSCRIPT ASSEMBLY to infer T and c with maximum likelihood.

to characterize alternative splicing, has a unique Hamiltonian path. We characterize solutions as subsets of non-overlapping edges in this graph, enabling the formulation of an efficient mixed integer linear program while accounting for paired-end read information. Using simulations, we show that our method, JUMPER, drastically outperforms SCALLOP [9] and STRINGTIE [10] , existing methods for transcript assembly. In real data [3] , we run JUMPER on paired-end short-read data of virus infected Vero cells and use long-read data of the same sample for validation. We find that JUMPER not only identifies canonical transcripts that are part of the reference transcriptome, but also predict expression of non-canonical transcripts that are well supported by long-read data. Similarly, JUMPER identifies canonical and non-canonical transcripts in SARS-CoV-1 samples [14] . In summary, JUMPER enables detailed de novo analyses of Nidovirales transcriptomes.

We define discontinuous transcripts as follows. In the literature, discontinuous transcripts that differ from the genomic transcript T 0 are called subgenomic transcripts, which correspond to subgenomic RNAs (sgRNAs) [3] .

While next-generation sequencing technologies provide high coverage of the viral genome of length L of about 10 to 30 Kbp, they are limited to short reads with fixed length ranging from 100 to 400 bp. For ease of exposition, we describe the formulation in context of single-end reads, but in practice we use the paired-end information if it is available. As L, the identity of the transcript of origin for a given read is ambiguous.

Therefore we need to use computational methods to reconstruct the transcripts and their abundances from the sequencing reads. Specifically, given a Nidovirales reference genome of length L and reads of a fixed length , we use a splice-aware aligner such as STAR [15] to obtain an alignment R. This alignment provides information about the abundance c and composition of the underlying transcripts T in the following two ways. First, the depth, or the number of reads along the genome is informative for quantifying the abundance c of the transcripts. Second, the composition T of the transcripts is embedded in phasing reads, which are reads that align to multiple distinct regions in the reference genome (Fig. 1B) .

To make the relationship between T , c and R clear, we introduce the segment graph G, which is obtained from the phasing reads in a alignment R. As mentioned, each phasing read r ∈ R maps to q ≥ 2 distinct regions in the reference genome. 

corresponds to segments that are adjacent in at least one phasing read in R but not adjacent in the reference genome (i.e. w − − v + ≥ 2). Fig. 1c shows an example of a segment graph.

Definition 2. Given an alignment R, the corresponding segment graph G = (V, E → ∪ E ) is a directed graph whose node set V equals the set of segments induced by the junctions of phasing reads in R and whose w − are adjacent.

We note that the segment graph G is closely related to the splice graph used in regular transcript assembly where transcripts correspond to varying sequences of exons due to alternative splicing. The key difference, however, is that an alignment R generated from reads obtained from discontinuous transcripts induces a segment graph G that is a directed acylic graph (DAG) with a unique Hamiltonian path. This is because, as stated in Definition 1, discontinuous transcripts T have matching 5' and 3' ends, and, although their comprising segments may vary, their order follows the reference genome.

Observation 1. Segment graph G is a directed acyclic graph with a unique Hamiltonian path.

The unique Hamiltonian path of G corresponds to the sequence of continuous edges E → . This path corresponds to the whole viral genome which is generated by the RdRp during the replication step [2] .

Moreover, by the above observation, G has a unique source node s and sink node t. Importantly, each transcript T ∈ T that is compatible with an alignment R corresponds to an s − t path π(T ) in G. Here, a path π is a subset of edges E that can be ordered

While splice graphs are DAGs and typically have a unique source and sink node as well, they do not necessarily contain a Hamiltonian path [9, [16] [17] [18] .

Our goal is to find a set T of transcripts and their abundances c that maximize the posterior probability

Under an uninformative, flat prior, this is equivalent to maximizing the probability Pr(R | T , c). We use the segment graph G to compute the probability Pr(R | T , c) of observing an alignment R given transcripts T and abundances c. We follow the generative model which has been extensively used for transcription quantification [19] [20] [21] . The notations used in this paper best resemble the formulation described in [18] .

Let R be composed of reads be {r 1 , . . . , r n } and the set T of transcripts be T = {T 1 , . . . , T k } with lengths L 1 , . . . , L k and abundances c = [c 1 , . . . , c k ]. In line with current literature, reads R are generated independently from transcripts T with abundances c. Further, we must marginalize over the set of transcripts T as the transcript of origin of any given read is typically unknown, since L. Moreover, we assume that the fixed read length is much smaller than the length L i of any transcript T i . As such, we that Pr(R | T , c)

equals

where π(T ) ⊆ E is the s − t path corresponding to transcript T and π(r) ⊆ E is the path induced by the ordered sequence of segments (or nodes of G) spanned by read r. By construction, π(T ) ⊇ π(r) is a necessary condition for transcript T to be a candidate transcript of origin of read r. Appendix A gives the derivation of the above equation (Eq. (1)). Our goal is to find arg max T ,c Pr(R | T , c), leading to the following problem.

Problem 1 (DISCONTINUOUS TRANSCRIPT ASSEMBLY (DTA)). Given alignment R and integer k, find discontinuous transcripts T = {T 1 , . . . , T k } and abundances c = [c 1 , . . . , c k ] such that (i) each transcript T i ∈ T is an s − t path in segment graph G, and (ii) Pr(R | T , c) is maximum.

Eq. (1) defines the probability Pr(R | T , c) in terms of the observed reads r and their induced paths π(r) ⊆ E(G) in the segment graph G. The authors in [18] use this characterization of reads as paths in a general splice graph to account for ambiguity in the transcript of origin for the reads. For a general splice graph, such a characterization is required to capture all the possible observed reads. However, in our setting,

where the segment graph G is a DAG with a unique Hamiltonian path, it is possible to describe each read and each transcript uniquely in a more concise form. Each path in the segment graph is characterized by a set of non-overlapping discontinuous edges. To describe this, we introduce the following definition.

For any transcript T corresponding to an s − t path in G, for which we are only given its discontinuous edges σ(T ), the continuous edges of T are uniquely determined by G and σ(T ). That is, the continuous edges of T equal precisely the subset of continuous edges E → that do not overlap with any of the discontinuous edges in σ(T ). Conversely, given an s−t path π(T ) of G the corresponding set of discontinuous edges is given by σ(T ) = π(T ) ∩ E . Thus, we have the following proposition with the proof in Appendix B.1.

There is a bijection between subsets of discontinuous edges that are pairwise nonoverlapping and s − t paths in G.

In a similar vein, rather than characterizing a read r by its induced path π(r) ⊆ E in the segment graph, we characterize a read r by a pair (σ ⊕ (r), σ (r)) of characteristic discontinuous edges. Here, σ ⊕ (r) is the set of discontinuous edges that must be present in any transcript that could generate read r,

i.e. σ ⊕ (r) = π(r) ∩ E . Conversely, σ (r) is the set of discontinuous edges that must be absent in any transcript that could generate read r due to the unidirectional nature of RdRp transcription. Thus, the set σ (r) consists of discontinuous edges E \ σ ⊕ that overlap with any edge in π(r). Clearly, while Fig. 2b ). Formally, we define (σ ⊕ (r), σ (r)) as follows.

Definition 4. The characteristic discontinuous edges of a read r are a pair (σ ⊕ (r), σ (r)) where σ ⊕ (r) is the set of discontinuous edges present in read r, i.e. σ ⊕ (r) = π(r) ∩ E , and σ i is the set of discontinuous

We have the following result with the proof given in Appendix B.1.

Let G be a segment graph, T be a transcript and r be a read. Then, π(T ) ⊇ π(r) if and only if σ(T ) ⊇ σ ⊕ (r) and σ(T ) ∩ σ (r) = ∅.

Hence, we may rewrite the likelihood Pr(R | T , c) as

where X(T , σ ⊕ j , σ j ) be the subset of indices i corresponding to transcripts T i ∈ T where σ(T i ) ⊇ σ ⊕ j and σ(T i ) ∩ σ j = ∅. Note that the only difference between Eq. (2) and the formulation in Eq. (1) is the way that the candidate transcripts of origin for a given read are described. In Eq. (1), they are described as paths in the splice graph wheres in Eq. (2), they are described by sets of pairwise non-overlapping discontinuous edges in the segment graph. This leads to the following theorem.

Theorem 1. For any alignment R, transcripts T and abundances c, Equations (1) and (2) are identical.

Although we have described the formulation for single-end reads, this characterization is applicable to paired-end and even synthetic long reads. Moreover, our implementation provides support for both singleend and paired-end read samples with a fixed read length. The above characterization using discontinuous edges allows us to reduce the number of terms in the likelihood function since multiple reads can be characterized by the same characteristic discontinuous edges. We describe this in detail in Section 4.

To solve the DTA problem, we use the results of Section 3 to write a more concise form of the likelihood.

Specifically, let S = {(σ ⊕ 1 , σ 1 ), . . . , (σ ⊕ m , σ m )} be the set of characteristic discontinuous edges generated by the reads in alignment R. Let d = {d 1 , · · · , d m } be the number of reads that map to each pair in S. Using that reads r with identical characteristic discontinuous edges (σ ⊕ (r), σ (r)) have identical probabilities Pr(r | T , c), we obtain the following mathematical program for the log-likelihood log Pr(R | T , c) (see

Observe that the first sum (over reads) is concave and the second sum (over transcripts) is convex.

Since we are maximizing, our objective function would ideally be concave. In Appendix B.1, we prove the following lemma, which enables us to remove the second term using a scaling factor for the relative abundances c that does not alter the solution space. is an optimal solution for

We formulate the mathematical program given in Lemma 1 as a mixed integer linear program. More specifically, we encode (i) the composition of each transcript T i as a set σ(T i ) of non-overlapping discontinuous edges, (ii) the abundance c i and length L i of each transcript T i , (iii) the total abundance i∈X(T ,σ ⊕ j ,σ j ) c i of transcripts supported by characteristic discontinuous edges (σ ⊕ j , σ j ), and (iv) a piecewise linear approximation of the log function using a user-specified number h of breakpoints. We will describe (i) and (ii) in the following and refer to Appendix B.2 for (iii) and (iv).

Transcript composition. We begin modeling (8) , which states that each transcript T i must correspond to an s − t path in the segment graph G. Using Proposition 1, we introduce binary variables x ∈ {0, 1} |E |×k 

To model the product c i L i of the length L i of a transcript T i and its abundance c i , we focus on individual discontinuous edges e. For any

Observe that

We introduce continuous variables z e ∈ [0, 1] k and encode the product z e,i = c i x e,i for all e ∈ E as

The The details of the subproblems SOLVEILP and SOLVELP are given in Appendix B.3.

Implementation details. Matching core sequences that mediate the discontinuous transcription by RdRp lead to ambiguity in precise location of breakpoint during alignment of spliced reads. Therefore, in practice we observe multiple discontinuous edges with closely spaced 5' and 3' breakpoints. Moreover, false 

Before we describe our results for coronavirus samples, we establish some terminology that will be used in assembled by JUMPER.

We begin our simulations with a segment graph G obtained from a short-read sample (SRR11409417).

Following Kim et al. [3] , we used fastp to trim short reads (trimming parameter set to 10 nucleotides), which were input to STAR run in two-pass mode yielding an alignment R. Fig. 3a shows the sashimi plot of the canonical and the non-canonical discontinuous edges (mappings) supported by the reads in the sample. From R, we obtained G by only including discontinuous edges supported by at least 20 reads.

The segment graph for this sample has |V | = 39 nodes and |E| = 67 edges, which include |E | = 29 discontinuous edges and |E → | = 38 continuous edges. The former are subdivided into 14 canonical discontinuous edges that produce a known ORF and 15 non-canonical discontinuous edges. The next step of our simulation pipeline is to generate transcripts T and their abundances c for the given segment graph. We used the negative-sense discontinuous transcription model (described in Appendix C.1). Upon generating the transcripts, we simulated the generation and sequencing of RNA-seq data, and aligned the simulated reads using STAR [15] . We generated 5 independent pairs (T , c) of transcripts and abundances under the negative-sense discontinuous transcription model. Fig. 3b shows the number of transcripts generated from each simulation using the negative-sense discontinuous transcription model. For each pair (T , c) we ran 5

paired-end short read sequencing simulations using polyester [24] . Therefore, we generated a total of 5 × 5 = 25 simulated instances. Details are provided in Appendix C.1. Note that our method JUMPER does not use the negative-sense discontinuous transcription model to infer the viral transcripts from the simulated data.

We compare the performance of our method JUMPER with two other reference-based transcript assembly methods, SCALLOP and STRINGTIE. All the methods were run with their default parameters. To avoid including false-positive discontinuous edges, JUMPER discards discontinuous edges with support less than shows that JUMPER outperforms SCALLOP and STRINGTIE for all values of Λ, although it incurs significantly more runtime for Λ = 10. Thus, we find that JUMPER consistently performs better than SCALLOP and STRINGTIE.

To better understand the tradeoff between precision and recall, we zoom in one five distinct pairs (T , c) 

Transcripts T Abundances c canonical non-canonical of simulation instance. Fig. 3d shows the precision and recall values achieved by each method for each of these five simulation instances, demonstrating that JUMPER consistently outperforms both SCALLOP and STRINGTIE. On average, JUMPER recalls 5 times more transcripts than SCALLOP and 11 times more transcripts than STRINGTIE while also having higher precision in all simulated cases. Fig. C4 shows that all three methods produce similar precision and recall values for different sequencing replicates of the same simulated instance of (T , c), demonstrating consistency in results. Finally, Fig. 3e shows the number of canonical and non-canonical transcripts generated by the three methods for each simulated instance that match the ground truth. In summary, we found that JUMPER correctly predicts higher number of both canonical and non-canonical transcripts compared to SCALLOP and STRINGTIE for all the simulated instances. We observe similar trends on simulated instances of a human gene (see Appendix C.4). The results from all the simulations are summarized in Table C1 .

In a recent paper, Kim et al. [3] explored the transcriptomic architecture of SARS-CoV-2 by performing short-read as well as long-read sequencing of Vero cells infected by the virus. The authors used oligo(dT) amplification, which targets the poly(A) tail at the 3' end of messenger RNAs, thus limiting positional bias that would occur when using SARS-CoV-2 specific primers [25, 26] . Subsequently, the authors aligned the resulting reads using splice-aware aligners, i.e. STAR [15] for the short-read sample (median depth of 1763)

and minimap2 [27] for the long-read sample (median depth of 6707 and mean length of 2875 bp). For both complementary sequencing techniques, the authors observed phasing reads that were indicative of canonical as well as non-canonical transcription events. While the authors quantified the fraction of phasing reads supporting each discontinuous transcription event, they did not attempt to assemble complete transcripts.

We use JUMPER to reconstruct the SARS-CoV-2 transcriptome of the short-read sequencing sample using the BAM file obtained by running Kim et al.'s pipeline [3] , followed by running SALMON to identify precise transcript abundances. We note that running SCALLOP on the short-read data resulted in only a single, complete canonical transcript (corresponding to 'N') but required subsampling of the BAM file (to 20%) due to memory constraints, whereas STRINGTIE produced two incomplete transcripts ('ORF3a' and and a non-canonical transcript with low support). We build a segment graph with |V | = 59 nodes and (abundance of 0.008) corresponds to the complete viral genome, necessary for viral replication. Notably, ORF10 is the only missing ORF in the identified transcriptome, which is in line with previous studies [3, 28] that did not find evidence for active transcription of ORF10.

As mentioned, JUMPER inferred 9 non-canonical transcripts, denoted as X, X', 1ab', S', 3a', 6', E', 7b * and N'. Among these, transcripts 1ab', S', 3a' and 6' encode for the 1ab polypeptide, spike protein S and accessory protein 3a and accessory protein 6, respectively. Transcripts X and X' both contain the discontinuous edge going from position 68 to 15774, with the latter containing an additional discontinuous edge from position 26256 to 26284. The 5' end of the common discontinuous edge occurs within TRS-L, whereas the 3' end occurs in the middle of ORF1b but is out of frame with respect to the starting position of ORF1b (13468). Specifically, the start codon 'ATG' downstream of the 3' end is located at position 15812 and occurs within nsp12 (RdRp) and the first stop codon is located at position 15896, encoding for a peptide sequence of 28 amino acids. Interestingly, when we examine the reference genome, we observe matching sequences "GAACTTTAA" near the 5' and 3' junctions of the discontinuous edge common to X and X', possibly explaining why the viral RdRp generated this jump (Fig. C5a,b) . Strikingly, both matching sequences are conserved within the Sarbecovirus subgenus but not in other subgenera of the Betacoronavirus genus (Fig. C5a,c) . To further corroborate this transcript, we examined short and long-read SARS-CoV-2 sequencing samples from the NCBI Sequence Read Archive (SRA). Specifically, we looked for the presence of reads potentially originating from transcript X focusing on high-quality samples with 100 or more leaderspanning reads (reads whose 5' end intersects with the TRS-L region). A read r supports a transcript T if the discontinuous edges of r exactly match those of T , i.e. π(r) ⊆ π(T ) and |σ ⊕ (r)| = |σ(T )| (Fig. C6) .

We find ample support for transcript X in both short and long-read samples on SRA, with 100 out of 351

short-read samples and 81 out of 653 long-read samples having more than 0.001 leader-spanning reads supporting transcript X (Fig. C7) . We note that although this discontinuous transcription event was also observed in [28] , the authors found no evidence of this transcript leading to protein product in the ribo-seq data. Further research into a potentially regulatory function of this transcript is required.

As stated, the difference between transcripts X and X' is that the latter includes an additional discontin- One of the major findings of the Kim et al. paper [3] is that the SARS-CoV-2 transcriptome is highly complex owing to numerous non-canonical discontinuous transcription events. Strikingly, our results show that these non-canonical transcription events do not significantly change the resulting proteins. Indeed, we find that 4 out of the 9 non-canonical transcripts produce a complete known viral protein and the total abundance of the predicted transcripts that produce a complete known viral protein is 0.968. Moreover, these predicted transcripts account for more than 90% of the reads in the sample according to the estimates provided by SALMON.

Typically, reads from short-read sequencing samples are not long enough to contain more than one discontinuous edge. As a result, short-read data can only provide direct evidence for transcripts with closely spaced discontinuous edges. For instance, we observe ample support (63485 short reads) for the predicted non-canonical transcript E', which has two discontinuous edges (69, 26237) and (26256, 26284), in shortread data due to the close proximity of the two discontinuous edges (i.e. the discontinuous edges are only 26256 − 26237 = 19 nucleotides apart). The other non-canonical transcripts with multiple discontinuous edges, i.e. X', S', 3a', 6' and N', have edges that are too far apart to be spanned by a single short read.

Using the long-read sequencing data of this sample, we detect supporting long reads that span the exact set of discontinuous edges of all 9 non-canonical transcripts (Fig. 4c) . Moreover, we find support for the canonical transcripts as well (Fig. C8) . Thus, all transcripts identified by JUMPER from the short-read data are supported by direct evidence in the long-read data.

In summary, using JUMPER we reconstructed a detailed picture of the transcriptome of a short-read sequencing sample of Vero cells infected by SARS-CoV-2. While existing methods failed to recall even the reference transcriptome, JUMPER identified transcripts encoding for all known viral protein products. In addition, our method predicted non-canonical transcripts and their abundances, whose presence we subsequently validated on a long-read sequencing sample of the same cells.

To show the generalizability of our method, we consider another coronavirus, SARS-CoV-1. We analyzed two published samples of human Calu-3 cells infected with SARS-CoV-1 [14] , SRR1942956 and SRR1942957, with a median depth of 21,358 and 20,991, respectively. These two samples originate from the same SRA project ('PRJNA279442') whose metadata states that both samples were sequenced 24 hours after infection. We used fastp to trim the short reads (trimming parameter set to 10 nucleotides) and

we aligned the resulting reads using STAR in two-pass mode. We ran JUMPER with the 35 most abundant discontinuous edges in the segment graph. Similarly to the previous analysis, we restrict our attention to transcripts identified by JUMPER that have more than 0.001 abundance as estimated by SALMON. There are 13 out of 25 such transcripts for sample SRR1942956 and 13 out of 26 such transcripts for sample SRR1942957 (Fig. 5 ).

SARS-CoV-1 has a genome of length 29751 bp, and consists of 13 ORFs (1ab, S, 3a, 3b, E, M, 6, 7a, 7b, 8a, 8b, N and 9b), two more than SARS-CoV-2. For both samples, JUMPER identifies canonical transcripts corresponding to all the ORFs of SARS-CoV-1 except ORF3b, ORF8b and ORF9b (Fig. 5) .

Notably, ORF8b and ORF9b share transcription regulating body sequences (TRS-B) with ORF8a and ORF N respectively [32] . More specifically, ORF9b (from position 28130 to 28426) is nested within ORF N (from position 28120 to 29388) with start codons only 10 nucleotides apart and consequently shares the same TRS-B as ORF N. ORF8b (from position 27864 to 28118) intersects with ORF8a (from 27779 to 27898) and previous studies have failed to validate a TRS-B region for ORF8b [32] . One possible way that these ORFs are translated is due to ribosome leaky scanning, which was also hypothesized to lead to ORF7b translation in SARS-CoV-2 [28] . This explains why JUMPER was unable to identify transcripts that directly encode for 8b and 9b. As for ORF3b, JUMPER did identify a canonical transcript corresponding to 3b in both samples, but the SALMON estimated abundances for these transcripts were below the cut-off value (0.00044 for SRR1942956 and 0.0005 for SRR1942957). Finally, we note that the relative abundances of the canonical transcripts are consistent for the two samples (Fig. 5) , ranked in the same order (Fig. C9) , with

ORF7b being the least abundant and ORF N having the largest abundance, in line with the observations in SARS-CoV-2 (see Section 5.2). domain [30] . This is similar to transcript 7b * in the SARS-CoV-2 sample (Section 5.2) which yields a N-terminal truncated version of protein 7b. Detection of non-canonical transcripts such as E' and 7b * in SARS-CoV-2 and N' in SARS-CoV-1 suggests that generation of N-terminally truncated proteins might be a common feature in coronaviruses. In summary, JUMPER can be applied to all coronaviruses to reconstruct the transcriptome from short-read sequencing data and be used to discover novel transcripts and the viral protein products that they encode.

In this paper, we formulated the DISCONTINUOUS TRANSCRIPT ASSEMBLY (DTA) problem of reconstructing viral transcripts from short-read RNA-seq data for SARS-CoV-2 and, more generally, viruses in the order of Nidovirales. The discontinuous transcription process exhibited by the viral RNA-dependent RNA polymerase (RdRp) is distinct from alternative splicing observed in eukaryotes. Our proposed method, JUMPER, is specifically designed to reconstruct the viral transcripts generated by discontinuous transcription and is therefore able to outperform existing transcript assembly methods such as SCALLOP and STRINGTIE, as we have shown in both simulated and real data. For real-data analysis, we used publicly available shortread and long-read sequencing data of the same sample of virus infected Vero cells [3] . We performed transcript assembly using the short-read sequencing data and used the long-read data for validation. JUMPER was able to identify transcripts encoding for all known viral proteins except ORF10, which has been shown to have little support of active transcription in previous studies [3, 28] . Moreover, we predicted 9 noncanonical transcripts that are well supported by long-read sequencing data. For two samples of Calu-3 cells infected by SARS-CoV-1 [14] , JUMPER reconstructed all the canonical transcripts with distinct TRS-B regions and additionally predicted the presence of non-canonical transcripts encoding for either complete or truncated versions of known viral proteins.

There are several avenues for future work. First, the computational complexity of the DTA problem remains open. Second, we plan to extend our current model to account for positional and sequencing biases in the data. Doing so will enable us to assemble transcriptomes from sequencing samples that used SARS-CoV-2-specific primers, which form the majority of currently available data. Third, we currently make the assumption that the read fragments are much smaller compared to the viral transcripts. We will relax this assumption in order to support long-read sequencing data that have variable read length. Fourth, we plan to study the effect of mutations (including single-nucleotide variants as well as indels) on the transcriptome.

Along the same lines, there is evidence of within-host diversity in COVID-19 patients [33] [34] [35] [36] [37] [38] . It will be interesting to study whether this diversity translates to distinct sets of transcripts and abundances within the same host. Fifth, there are possibly multiple optimal solutions to the DTA problem that present equally likely viral transcripts with different relative abundances in the sample. A useful direction of future work is to explore the space of optimal solutions similar to the work done in [18] . Finally, the approach presented in this paper can extended to the general transcript assembly problem, where a topological ordering of the nodes in the splice graph will serve the same function as the unique Hamiltonian path of the segment graph did in the DTA problem. We envision this will facilitate efficient use of combinatorial optimization tools such as integer linear programming to transcript assembly problems.

Data availability. Simulated data is available at https://doi.org/10.13012/B2IDB-6667667_ V1. The code is available at https://github.com/elkebir-group/Jumper.

We use the segment graph G to compute the probability Pr(R | T , c) of observing the alignment R given transcripts T and abundances c. We follow the generative model described in [39] , which has been extensively used for transcription quantification [19] [20] [21] . Pr(r j | T , c)

where Z i,j is the indicator random variable for the event that T i is the transcript of origin for read r j . We denote by Pr(r j | Z i,j ) the probability of observing read r j given that it is generated from transcript T i and Pr(Z i,j | T , c) denotes the probability of generating a read from transcript T i given transcripts T and abundances c.

Assuming no amplification and sequencing bias, the probability Pr(Z i,j | T , c) of generating a read from a transcript T i of length L i is given by

We now derive the probability Pr(r j | Z i,j ) of transcript T i generating read r j of fixed length . We do so using the segment graph G = (V, E). Recall that a transcript T must correspond to an s to t path in G. Let π(T ) ⊆ E denote the path corresponding to transcript T . Similarly, each read r induces a path π(r) ⊆ E in G. Read r can only be generated by transcript T if π(r) ⊆ π(T ). Hence, the probability of transcript T i generating a given read r j is given by

where L i = L i − is the effective length of the transcript. We assume that the transcripts are much longer than the reads and as such L i /L i ≈ 1. Putting it all together we get

We prove the following two main text propositions.

(Main Text) Proposition 1. There is a bijection between subsets of discontinuous edges that are pairwise non-overlapping and s − t paths in G.

Proof. Let Π be the set of s − t paths in G. We indicate with Σ the family of subsets of discontinuous edges that are pairwise non-overlapping. Note that Σ ⊆ 2 E .

For an s − t path π ∈ Π, let f (π) be the set of discontinuous edges in π, i.e. f (π) = π ∩ E . Since π is an s − t path of G, we have that for each edge

Therefore, f (π) is composed of pairwise non-overlapping disconnected edges. Now, consider a subset σ ∈ Σ of discontinuous edges that are pairwise non-overlapping. We obtain the corresponding s − t path f −1 (σ) by first ordering the edges of σ in ascending order. That is, let

for all i ∈ {1, . . . , |σ|−1}. For every two consecutive discontinuous edges

, we include the corresponding subpath of continuous edges from w i to v i+1 into f −1 (σ). In addition, we include the subpath of continuous edges from node s to node v 1 as well as the subpath from node w |σ| to t into f −1 (σ). By construction, f −1 (σ) is an s − t path.

(Main Text) Proposition 2. Let G be a segment graph, T be a transcript and r be a read. Then, π(T ) ⊇ π(r) if and only if σ(T ) ⊇ σ ⊕ (r) and σ(T ) ∩ σ (r) = ∅.

Proof. (⇒) (⇒) By the premise, π(T ) ⊇ π(r). By definition, σ(T ) = π(T ) ∩ E . By Definition 4, σ ⊕ (r) = π(r) ∩ E . As π(T ) ⊇ π(r), we have that σ(T ) = π(T ) ∩ E ⊇ π(r) ∩ E = σ ⊕ (r).

By definition, σ (r) is the subset of discontinuous edges in E \ σ ⊕ (r) that overlaps with an edge in π(r). Since π(T ) ⊇ π(r), every edge included in σ (r) because of an overlap with an edge in π(r) must also overlap with the same edge in π(T ). Since π(T ) is an s − t path, and thus does not contain pairwise overlapping edges, we infer that σ (r) ∩ σ(T ) = ∅.

(⇐) (⇐) By the premise, σ(T ) ⊇ σ ⊕ (r) and σ(T ) ∩ σ (r) = ∅. As σ(T ) ⊇ σ ⊕ (r), we have that π(T ) ∩ E = σ(T ) ⊇ σ ⊕ (r) = π(r) ∩ E . Since σ(T ) ∩ σ (r) = ∅, we have by Definition 4, that no discontinuous edge in σ(T ) overlaps with any edge in π(r). Since π(T ) is an s−t path containing the subset σ ⊕ (r) of discontinuous edges in π(r), it holds that π(T ) ∩ E → ⊇ π(r) ∩ E → . Finally, as E ∪ E → = E, π(r) ⊆ E and π(T ) ⊆ E, we get π(T ) ⊇ π(r).

Using this proposition, we derive a simpler form of the likelihood given in Equation (2). Let S = {(σ ⊕ 1 , σ 1 ), . . . , (σ ⊕ m , σ m )} be the set of characteristic discontinuous edges generated by the reads in alignment R. Let d = {d 1 , · · · , d m } be the number of reads that map to each pair in S. Using that distinct reads r j and r j with the same characteristic discontinuous edges (σ ⊕ (r j ), σ (r j )) = (σ ⊕ (r j ), σ (r j )) have the same likelihood in terms of (2), we have

Now, taking the logarithm yields

The goal is the remove the second sum in the above equation, as it is convex and we are maximizing. In order to do so, we first prove the following lemma.

Lemma 2. For any given scaling factor α > 0, we have that log Pr(R | T , c) = log Pr(R | T , αc).

log

This enables us to prove the following lemma. 

Proof. We will refer to the optimization problem in (3)-(6) as P and the optimization problem (14)-(17) as Q. Further, we will refer to the objective function in (3) as J(T , c) and the objective function in (14) as

where the last equality uses (13) . We now show that if (T , c) is an optimal solution to problem P , then (T , c) is an optimal solution to problem Q. Let (T , c ) be an optimal solution to problem Q. Then, by optimality of (T , c ), we have

Let c = [c 1 (c ), . . . , c k (c )]. Note that c satisfies constraints (5) . Thus (T , c ) is a feasible solution to problem P . Since (T , c) is an optimal solution of P , we have

Since c and c only differ by a positive scaling factor α = 1/ k i=1 c i , we use Lemma 2 to get J(T , c ) = J(T , c ). Similar result holds for c and c, i.e. J(T , c) = J(T , c). Applying this to (20) , we get J(T , c) ≥ J(T , c ).

Using (16) and (18), we get

Finally, using (19) and ( Next, we need to show that (T , c) is an optimal solution to problem P . Let (T , c ) be an optimal solution to problem P .

Then, from the optimality condition, we get

Let c = [c 1 (c ), . . . , c k (c )]. Note that c satisties constraint (16) and thus (T , c ) is a feasible solution to problem Q. Using (18) and the fact that (T , c) is an optimal solution of problem P we get

Observe that c and c only differ by a positive scaling factor α = D/ k j=1 c j L j . Therefore, using 

Moreover, (22) and (24) 

In the following, we introduce variables and constraints to encode the following.

(i) The composition of each transcript T i as a set σ(T i ) of non-overlapping discontinuous edges.

(ii) The abundance c i and length L i of each transcript T i .

(iii) The total abundance i∈X(T ,σ ⊕ j ,σ j ) c i of transcripts supported by characteristic discontinuous edges (σ ⊕ j , σ j ).

(iv) A piecewise linear approximation of the log function.

We describe (iii) and (iv) in the following and refer to Section 4 for (i) and (ii).

Contribution of transcripts to each pair of characteristic discontinuous edges. The objective function has m terms, one corresponding to each pair (σ ⊕ j , σ j ) ∈ S of characteristic discontinuous edges (see Eq. (7)). Specifically, each term j equals d j log i∈X(T ,σ ⊕ j ,σ j ) c i where d j is a constant, for all j ∈ [m]. We introduce non-negative continuous variables q = {q 1 , . . . , q m } such that

where the last equality uses the characterization of candidate transcripts of origin for a given read described in Proposition 2. We introduce continuous variables y j ∈ [0, 1] k that encode the product y j,i =

x e,i e ∈σ j x e ,i . Intuitively, each variable y j,i encodes the contribution of a transcript T i for the given characteristic discontinuous edge sets (σ ⊕ j , σ j ). We linearize the product c i e∈σ ⊕ j x e,i e ∈σ j x e ,i as follows.

Hence, we have

Objective function. The objective function (Eq. (7)) can be written in terms of continuous variables q as

where d j is a constant and q is as in (26) . We use the lambda method to approximate our objective method using a piecewise linear function [40] . Following the method described in [40] , we partition the domain 

. Therefore the objective function we wish to maximize is

Note that since we have a log-likelihood objective function, feasibility of the solution requires that

. This means that for each characteristic discontinuous edge sets (σ ⊕ j , σ j ), there must be at least one candidate transcript of origin T i with non-zero abundance c i > 0. This leads to the solution containing a large number of transcripts and making the problem intractable while also preventing us from finding parsimonious sets of transcripts that support most but not all of the observed reads in the sample.

Finding such parsimonious solutions is often desirable since they provide a reasonable explanation of the observed reads while keeping the problem computationally tractable. In order to allow us to generate solutions that can partially explain the observed reads, we slightly modify our objective function. We introduce a new breakpoint b 0 = 0 and associated continuous variables

The objective function we maximize is

where δ > 0 is a small constant. Note that instead of evaluating the log function at b 0 , we include log(δ) which is well defined since δ > 0. In this study, we choose δ = b 1 /100 = 1/(2 h−1 × 100) while h is left as the user's choice with default value of 16.

Moreover, the choice of breakpoints to approximate the objective function (Eq. 7) can have a significant impact on the accuracy of the MILP solver. As a result, there has been research in efficient methods for choosing optimal breakpoint locations for convex functions, such as recursive descent algorithms [41] . In this work we take a simpler approach, by choosing breakpoints such that their spacing around a given breakpoint is proportional to the local gradient of the objective function. For the log function, this is equivalent to choosing breakpoints such that b i = 2 i−1 /2 h−1 . Note that b 0 = 1/2 h−1 while b h = 1.

Number of variables and constraints. The total number of binary variables x is |E |k. Note that q are auxiliary (intermediate) variables that are uniquely determined by c, y, z and λ. Therefore, the total number of required continuous variables (i.e. c, y, z and λ) is k + mk + |E |k + mh. The number of constraints is O(k|E| 2 + |E|km). We provide the full MILP for reference. 

Here we describe the subproblems that are solved at each iteration of the greedy heuristic. For a given set of transcripts T and characteristic discontinuous edge sets S, consider the optimization problem which we denote by P 1 ,

and the following optimization problem denoted by P 2 ,

Solution to P 1 We obtain the solution of P 1 by solving the optimization problem (7) In practice, we see spurious discontinuous edges in the resulting segment graph due to sequencing and alignment errors. We filter these edges by requiring a minimum number Λ of spliced reads to support each discontinuous edge in the segment graph. The higher the value of Λ, fewer will be the number of edges and nodes in the resulting segment graph.

It is not trivial to know the right value of Λ to remove all false positive discontinuous edges. Several heuristics are used in existing methods to remove spurious splicing events. SCALLOP removes an edge e from its splice graph if the coverage of the exons of either end of the edge is more than 2w(e) 2 + 18, where w(e) is the number of spliced reads that support the edge e. STRINGTIE on the other hand, terminates its algorithm of assembling transcripts when the coverage of all the paths in the splice graph build from the un-assigned reads drops below a threshold, set by default to 2.5 reads per base-pair. By default, JUMPER requires a support of 100 reads for a discontinuous edge to be included in the segment graph.

Another parameter that can be used to filter false-positive splicing events is the number of discontinuous edges allowed in the segment graph. From tests on simulated instances emulating SARS-CoV-2 samples, we found that focusing on the 35 most abundant discontinuous edges is sufficient to get a summary of the transcriptome and highly expressed canonical and non-canonical transcripts in the sample. A higher value can be used to capture more complexity of the transcriptome. By default, we set this parameter to 35.

Our simulations are based on a widely believed model of discontinuous transcription. Briefly, there are two competing models of discontinuous transcription for coronaviruses [42] . Both models agree that the RdRP jump is mediated by matching core-sequences (motifs) present in the TRSs in the viral genome. The only point of difference between the two models is whether discontinuous transcription occurs during the plusstrand synthesis or the minus-strand synthesis. The negative-sense discontinuous transcription model [43] proposes that the it is during the minus-strand synthesis that the RdRP performs discontinuous transcription.

Transcription is initiated at the 3' end of the plus-strand RNA and the RdRP jumps to the TRS-L region when it reaches a TRS-B region adjacent to a gene, thereby generating a minus-strand subgenomic RNA.

The minus-strand subgenomic RNA is then replicated by the RdRP to produce a plus-strand RNA which can be translated to a viral protein. Currently, this model is largely believed to be true due to the considerable experimental support from genetic studies detecting minus-strand subgenomic RNAs [44] [45] [46] [47] [48] .

We now describe the procedure to simulate transcripts and their abundances following the negative-sense model of discontinuous transcription for a given segment graph. otherwise. This is done to ensure that canonical transcripts are generated with high enough abundance, making the simulations similar to real data. The next step of our simulation pipeline is to generate transcripts T and their abundances c for the given segment graph. We simulate the transcription process by generating 100,000 s − t paths on the segment graph and report the number of unique paths/transcripts T and their abundances c. We repeat this process to generate 5 independent sets of transcripts and abundances for the positive and the negative model each. Fig. 3b shows the number of transcripts generated from each simulation using the negative-sense discontinuous transcription model. To contrast, the total number of s − t paths in the underlying segment graph is 3440.

Once the transcripts are generated, the next step in our pipeline is to simulate the generation and sequencing of RNA-seq data. We use polyester [24] for this step as it allows the user to provide the number of reads generated from each transcript. For a given total number n of reads, the number of reads generated from transcript T i is given by nc i L i / k j=1 c j L j where L i is the length of the transcript T i . We use the default parameters for read length ( = 100) and fragment length distribution (Gaussian with mean µ r = 250 and standard deviation σ r = 25) to generate 3,000,000 reads. For each set of transcript and abundances generated in the previous step of the pipeline, we simulate 5 replicates of the sequencing experiment.

The final step of the simulation pipeline is to align the generated reads to the reference genome NC 045512.2 using STAR [15] . The resulting BAM file serves as the input for the transcription assembly methods. To summarize, we generated 5 independent pairs (T , c) of transcripts and abundances under the negative-sense discontinuous transcription model. For each pair (T , c) we run 5 simulated sequencing experiments using polyester [24] . Therefore, we generated a total of 5 × 5 = 25 simulated instances.

We use the following arguments.

We run STRINGTIE in de novo transcript assembly mode. That is, we do not provide a GFF file to guide assembly. We use the following arguments.

We noted that STRINGTIE produces incomplete transcripts, i.e. all the assembled transcripts did not map to the 5' and 3' end of the reference genome. In our simulations, STRINGTIE was not penalized for this as our evaluation metrics considered only discontinuous edges.

We evaluate the performance of JUMPER, SCALLOP and STRINGTIE on simulated samples of a single human gene as well. We use a selected region (from position 5001 to 30255) of the FAS gene as the reference genome 1 with is located on the long arm of chromosome 10 in humans and encodes the Fas cell surface receptor which leads to programmed cell death if it binds its ligand (Fas ligand).

We include the following 3 distinct isoforms of this gene (P25445-1, P25445-6 and P25445-7) with equal proportions in the ground truth. 

We have the following supplementary figures.

• Fig. C2 shows that JUMPER outperforms SCALLOP and STRINGTIE for all simulation instances in terms of F 1 score, recall and precision while maintaining a modest running time.

• Fig. C3 shows that JUMPER outperforms SCALLOP and STRINGTIE for varying values of thresholding parameter Λ.

• Fig. C4 shows that JUMPER produces better recall and precision when compared to SCALLOP and STRINGTIE for every simulation instance (T , c).

• Fig. C5 shows that the core sequence observed in the reference genome potentially explaining a noncanonical discontinuous transcription event, and the core sequence corresponding to transcript X is conserved across Sarbecovirus species.

• Fig. C6 shows an example of a supporting read for a transcript with two discontinuous edges.

• Fig. C7 shows that transcript X is supported in both long-read and short-read samples deposited in SRA.

• Fig. C8 shows the number of supporting reads with the 5' end mapping to the leader sequence in the short and long read sequencing data.

• Fig. C9 shows the abundances of the predicted transcripts by JUMPER in two SARS-CoV-1 infected samples.

• Table C1 shows summary of the results from the simulations. (a) F 1 score (b) recall, (c) precision and (d) time taken by the JUMPER for different values of Λ compared to SCALLOP and STRINGTIE on the simulated instances. As expected, the recall value drops for increasing Λ while the precision increases. We set the default value of Λ to 100 which incurs runtime comparable to SCALLOP while producing higher recall and precision solutions. 1 is supported by r 2 because π(r 2 ) = π(T 1 ) and |σ ⊕ (r 1 )| = |σ ⊕ (T 1 )| = 2. Reads r 1 , r 3 and r 4 do not support T 1 since |σ ⊕ (r 1 )| < |σ ⊕ (T 1 )| and π(r 3 ), π(r 4 ) π(T 1 ). No reads support T 2 since |σ ⊕ (r j )| < |σ ⊕ (T 2 )| for all reads r j . Figure C7 : Transcript X has supporting reads in multiple independent SRA samples of SARS-CoV-2. Distribution of number of (a) short-read and (b) long-read SRA samples with varying proportion of leader-sequence spanning reads that support transcript X. All the short-read samples were aligned using STAR [15] while the long-read samples were aligned using minimap2 [27] . In this plot we only consider samples with more than 100 reads that map to the leader-sequence (position 55 to 85 in the SARS-CoV-2 reference genome). non-canonical transcripts across the two samples.

The genome organization of the Nidovirales: similarities and differences between Arteri-, Toro-, and Coronaviruses

Coronaviruses: methods and protocols

The architecture of SARS-CoV-2 transcriptome

De novo assembly and analysis of RNA-seq data

Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data

SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads

Bridger: a new framework for de novo transcriptome assembly using RNA-seq data

Oases: robust de novo RNAseq assembly across the dynamic range of expression levels

Accurate assembly of transcripts through phase-preserving graph decomposition

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

CLASS: constrained transcript assembly of RNA-seq reads

Competing endogenous RNA network profiling reveals novel host dependency factors required for MERS-CoV propagation. Emerging microbes & infections

STAR: ultrafast universal RNA-seq aligner

Efficient RNA isoform identification and quantification from RNA-Seq data with network flows

A convex formulation for joint RNA isoform detection and quantification from multiple RNA-seq samples

Finding ranges of optimal transcript expression quantification in cases of non-identifiability

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

Salmon provides fast and bias-aware quantification of transcript expression

Near-optimal probabilistic RNA-seq quantification

Gurobi optimizer reference manual

The sequence alignment/map format and SAMtools

Polyester: simulating RNA-seq datasets with differential transcript expression

A rapid, cost-effective tailed amplicon method for sequencing SARS-CoV-2

nCoV-2019 sequencing protocol v3 (LoCost). protocols.io

Minimap2: pairwise alignment for nucleotide sequences

The coding capacity of sars-cov-2

Structure and drug binding of the SARS-CoV-2 envelope protein transmembrane domain in lipid bilayers

Crystal structure of sars-cov-2 nucleocapsid protein rna binding domain reveals potential unique drug targeting sites

Architecture and self-assembly of the sars-cov-2 nucleocapsid protein

Characterizing Transcriptional Regulatory Sequences in Coronaviruses and Their Role in Recombination

Characterization of SARS-CoV-2 viral diversity within and across hosts. bioRxiv

Intra-host site-specific polymorphisms of SARS-CoV-2 is consistent across multiple samples and methodologies. medRxiv

Characterization of intra-host SARS-CoV-2 variants improves phylogenomic reconstruction and may reveal functionally convergent mutations. bioRxiv

Genomic diversity of SARS-CoV-2 in coronavirus disease 2019 patients

SARS-CoV-2 exhibits intra-host genomic plasticity and low-frequency polymorphic quasispecies. bioRxiv

On the origin and continuing evolution of SARS-CoV-2

Exact transcript quantification over splice graphs

Polyhedral methods for piecewise-linear functions I: the lambda method

A recursive descent algorithm for finding the optimal minimax piecewise linear approximation of convex functions

Nidovirus transcription: how to make sense

Coronaviruses use discontinuous extension for synthesis of subgenome-length negative strands

Arterivirus discontinuous mRNA transcription is guided by base pairing between sense and antisense transcription-regulating sequences

Sequence motifs involved in the regulation of discontinuous coronavirus subgenomic RNA synthesis

Sequence requirements for RNA strand transfer during nidovirus discontinuous subgenomic RNA synthesis

The RNA structures engaged in replication and transcription of the A59 strain of mouse hepatitis virus

Recombinant equine arteritis virus as an expression vector

We add a poly-A tail of length 85 at the end of the reference genome as well as each of the isoforms to emulate the transcription process

sequencing of 30,000,000 paired-end reads of the sample with a Gaussian fragment length distribution with mean 250 and standard deviation of 25. We simulate 5 replicates of the sequencing experiment. The simulated reads are aligned to the selected region of the FAS gene using STAR [15]. The resulting BAM file serves as the input for the transcription assembly methods

Acknowledgments. The authors were supported by the National Science Foundation under award number CCF-2027669.

 Simulation   JUMPER  SCALLOP  STRINGTIE   seed rep can non-can  TP  FP  TP  FP  TP  FP  can non-can  can non-can  can non-can   0  1  14  94  7  9  1  7  4  8  2  0  14   0  2  14  94  8  11  0  7  2  8  1  0  13   0  3  14  94  7  11  2  4  1  8  1  0