key: cord-0267023-kjr7zgm0 authors: Zhou, Xia; Huang, Xiaolan; Du, Zhihua title: A Computational and Biochemical Study of -1 Ribosomal Frameshifting in Human mRNAs date: 2021-04-24 journal: bioRxiv DOI: 10.1101/2021.04.23.441185 sha: 1905b33562cd284f355d84d95b8533a9719c52df doc_id: 267023 cord_uid: kjr7zgm0 −1 programmed ribosomal frameshifting (−1 PRF) is a translational recoding mechanism used by many viral and cellular mRNAs. −1 PRF occurs at a heptanucleotide slippery sequence and is stimulated by a downstream RNA structure, most often in the form of a pseudoknot. The utilization of −1 PRF to produce proteins encoded by the −1 reading frame is wide-spread in RNA viruses, but relatively rare in cellular mRNAs. In human, only three such cases of −1 PRF events have been reported, all involving retroviral-like genes and protein products. To evaluate the extent of −1 PRF utilization in the human transcriptome, we have developed a computational scheme for identifying putative pseudoknot-dependent −1 PRF events and applied the method to a collection of 43,191 human mRNAs in the NCBI RefSeq database. In addition to the three reported cases, our study identified more than two dozen putative −1 PRF cases. The genes involved in these cases are genuine cellular genes without a viral origin. Moreover, in more than half of these cases, the frameshift site locates far upstream (>250 nt) from the stop codon of the 0 reading frame, which is nonviral-like. Using dual luciferase assays in HEK293T cells, we confirmed that the −1 PRF signals in the mRNAs of CDK5R2 and SEMA6C are functional in inducing efficient frameshifting. Our findings have significant implications in expanding the repertoire of the −1 PRF phenomenon and the protein-coding capacity of the human transcriptome. frame, X can be any nucleotide, Y can be A or U, Z can be A, C or U). During −1 PRF, the P site and A site tRNAs on the XXY and YYZ codons shift back one nucleotide to the XXX and YYY codons in the −1 reading frame. Efficient frameshifting is often stimulated by an RNA structure located several nucleotides downstream of the slippery sequence. In most known cases of −1 PRF, the frameshift stimulator RNA structure is an H (hairpin) type pseudoknot (3) (4) (5) (6) (7) . An H-type pseudoknot is formed when a stretch of nucleotides within the loop region of a hairpin basepairs with a complementary sequence outside that loop (8) (9) (10) . It is well known that many RNA viruses, including the etiological agents for AIDS (HIV-1: human immunodeficiency virus type-1) and COVID-19 (SARS-CoV2: severe acute respiratory syndrome coronavirus 2), utilize −1 PRF to express certain key viral proteins at a defined ratio (1, 6, 11, 12) . It was found that the levels of −1 PRF are maintained within a relatively narrow range, increasing or decreasing in the −1 PRF efficiency outside of that range significantly attenuates the production of infectious virions (13) (14) (15) (16) . Therefore, perturbation of −1 PRF efficiency may represent a viable strategy for the development of antiviral therapeutics against viral pathogens. Efforts have been made to exploit the viral −1 PRF signals, especially the frameshiftstimulatory structures, as putative drug targets (17) (18) (19) (20) (21) (22) (23) (24) (25) . Small organic compounds, peptides, antisense oligonucleotides and peptide nucleic acids (PNAs) have been identified that can modulate −1 PRF induced by the frameshifting signals in HIV−1, SARS-CoV, and SARS-CoV2 (17, 18, (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) . In contrast to the wide-spread utilization of −1 PRF in RNA viruses, only a few cases of −1 PRF have been reported in cellular mRNAs. So far, retrovirus-like −1 PRF mechanism has been reported in the expression of three mammalian genes: including the human paternally expressed gene 10 (PEG10) (36) , and the paraneoplastic antigen Ma3 & Ma5 genes (37) . All of these genes are derived from retroelements and they encode viral-like proteins. In these cases, pseudoknot-dependent −1 PRF is used to express the overlapping −1 reading frame sequences in the mRNAs, with 15-30% efficiencies. Another reported case of pseudoknot-dependent −1 PRF in human mRNAs is in the C-C chemokine receptor type 5 (CCR5) mRNA (38) . −1 PRF in CCR5 mRNA leads the translating ribosome to a premature termination codon in the −1 reading frame, which triggers the nonsense-mediated mRNA decay pathway to degrade the mRNA. In a computational study searching for potential −1 PRF signals in eukaryotes, it was found that up to 10% of mRNAs might utilize the −1 PRF mechanism (39) . In more than 99% of the putative −1 PRF cases, a termination codon in the −1 reading frame is present shortly downstream from the frameshift site. Therefore −1 PRF in these mRNA would lead to termination of translation. It was proposed that these eukaryotic −1 PRF signals function as mRNA destabilizing elements through the nonsense-mediated mRNA decay (NMD) mechanism (38, 40) . We have used bioinformatics approaches to perform a large scale search for possible cases of pseudoknot dependent −1 PRF events in human mRNAs, covering 43, 191 human mRNA sequences in the NCBI RefSeq database. Several dozens of putative cases were identified. Using dual luciferase reporter assays in HEK293 cells, we confirmed that two of the putative cases harbor functional −1 PRF events. Moreover, −1 PRF in these two mRNAs differ from all of PEG10, Ma3, Ma5, and CCR5 mRNAs in that the frameshift site is located far away from the C-terminus of the normal protein and a significantly long peptide is encoded by the −1 reading frame after frameshifting. Abbreviations: S1&2, stem1&2; L1-3, loop1-3. A: linear sequential arrangement of the pseudoknot-forming sequence elements. Residues involved in the formation of S1 and S2 are represented as black and gray squares respectively. Residues in the singlestranded loop region are represented as unfilled circles. B: Schematic representations of a folded pseudoknot. Left: with a non-zero L3 sequence; right: in the absence of L3, S1 and S2 can stack coaxially to form a quasi-continuous double helix. L1 and L2 locate on the same side of the double helix, with L1 crossing the major groove of S2 and L2 crossing the minor groove of S1. The Data for this study consist of 43,191 human mRNA sequences obtained from the NCBI RefSeq database (www.ncbi.nlm.nih.gov/RefSeq/). Many algorithms and approaches had been developed to predict the formation of RNA pseudoknots. It has been proved that the general problem of predicting RNA pseudoknot is a non-deterministic polynomial time NP-complete problem. In general, most practical methods of predicting RNA pseudoknots have long running times and low accuracy for longer sequences. To deal with the large number of long mRNA sequences, the previously reported algorithms and programs are not suitable. To facilitate our study, we have developed a new heuristic program called PKscan for the identification of RNA pseudoknots. PKscan uses a dynamic sliding window that scans through the RNA sequence, therefore there is no limitation on the length of the RNA sequence for analysis. Within each sliding window, iterating cycles of positional pairwise base matching are performed to detect complementary basepairing between two stretches of nucleotides. All possible combinations of stem and loop lengths within pre-defined ranges are interrogated for potential pseudoknot formation. PKscan can efficiently identify all potential RNA pseudoknots in any RNA sequence with unlimited length. Several relevant terms are defined mathematically as follow: An RNA sequence is defined as a string of S= S1 S2 … Sn, where S I ={A, U, G, C}. Legitimate basepairs in RNA structures are the Watson-Crick basepairs and the GU wobble basepair. A legitimate basepair is represented as (S i, S j) (i and j are positive integers and i≠ j), which must be one of the following basepairs: (A,U), (U,A), (G,C), (C,G), (G,U), or (U,G). An RNA stem-loop structure is defined by the existence of two stretches of nucleotides (separated by an intervening sequence with a certain number of nucleotides) forming complementary basepairs. An RNA stem-loop structure is mathematically represented as (S{i}, S{j}), where {i} and {i} are arrays of consecutive positive integers, and jmin -imax≥ t (jmin is the smallest number of the {i} array, and imax is the largest number of the {i} array). The value t is a positive integer. To form a stemloop structure, the RNA sequence folds back to itself; but the loop region must have more than t nucleotides to be physically possible. In the context of our pseudoknot detecting algorithms, t is a dynamic number depending on the stem and loop lengths of a potential pseudoknot in a particular iterating cycle of pseudoknot detection. Pseudoknot here refers to the H-type RNA pseudoknot. Mathematically, pseudoknot is defined by the simultaneous existence of two interlocking stem loop structures in a given RNA sequence (Figure 1 ). The two stem (double-stranded helical) regions of the stem-loop structures are represented as (S{i}, S{j}) and (S{I'}, S{j'}), where imax