key: cord-0976848-jd6ajrrt authors: Light, Dean; Haas, Roni; Yazbak, Mahmoud; Elfand, Tal; Blau, Tal; Lamm, Ayelet T. title: RESIC: A tool for comprehensive adenosine to inosine RNA Editing Site Identification and Classification date: 2021-04-12 journal: bioRxiv DOI: 10.1101/2021.04.11.439401 sha: d3d044d145af8a2810219444674655d99a5d407c doc_id: 976848 cord_uid: jd6ajrrt Adenosine to inosine (A-to-I) RNA editing, the most prevalent type of RNA editing in metazoans, is carried out by adenosine deaminases (ADARs) in double-stranded RNA regions. Several computational approaches have been recently developed to identify A-to-I RNA editing sites from sequencing data, each addressing a particular issue. Here we present RESIC, an efficient pipeline that combines several approaches for the detection and classification of RNA editing sites. The pipeline can be used for all organisms and can use any number of RNA-sequencing datasets as input. RESIC provides 1. The detection of editing sites in both repetitive and non-repetitive genomic regions; 2. The identification of hyper-edited regions; 3. Optional exclusion of polymorphism sites to increase reliability, based on DNA, and ADAR-mutant RNA sequencing datasets, or SNP databases. We demonstrate the utility of RESIC by applying it to human, successfully overlapping and extending the list of known putative editing sites. We further tested changes in the patterns of A-to-I RNA editing, and RNA abundance of ADAR enzymes, following SARS-CoV-2 infection in human cell lines. Our results suggest that upon SARS-CoV-2 infection, compared to mock, the number of hyper editing sites is increased, and in agreement, the activity of ADAR1, which catalyzes hyper-editing, is enhanced. These results imply the involvement of A-to-I RNA editing in conceiving the unpredicted phenotype of COVID-19 disease. RESIC code is open-source and is easily extendable. The conversion of adenosine to inosine (A-to-I) in double-stranded RNA regions, by adenosine 30 deaminases (ADARs) enzymes, is the most common form of RNA editing in metazoans (Bazak et to cause alignment errors (Treangen and Salzberg, 2011) . 53 Several tools developed to detect A-to-I RNA editing sites are based on comparison between RNA-54 seq reads and DNA reference genome. Among these tools are REDItools, which suggest simple 55 comparison using samtools (Picardi and Pesole, 2013) , and GIREMI that focused on distinguishing 56 between SNPs and editing, relying on existing SNP databases and a given RNA-seq data (Zhang 57 and Xiao, 2015). Some tools support a direct comparison between RNA-seq reads and DNA reads Hyper-editing by ADAR enzymes, which is defined as multiple A-to-I editing sites in a proximity, 67 is a widespread phenomenon. Since most tools designed to identify editing sites are based on the 68 detection of a limited number of mismatches in read alignments (to reduce alignment errors and to Gs in the reference genome and in the RNA-seq reads that previously failed to align, and 75 realignment is then carried out. Following reversion to original sequences, hyper-editing sites, 76 which are rich with A-to-G mismatches, can be located. In both studies, conversion to a three-base 77 code was repeated for all possible nucleotide pairs. It was shown that A-to-G editing was enriched 78 over the other editing types. Despite the efforts to develop computational tools for A-to-I RNA editing site detection from 80 sequencing data, to date there is not a single platform enabling robust detection of editing sites of 81 different classes and their classification. Here we present RESIC, which enables detection and 82 classification of A-to-I RNA editing sites of different types in a single tool. We expanded the pipeline we previously applied to identify editing sites in repetitive and non-repetitive regions 84 (Goldstein et al., 2017) and adopted the method by Wu et al. 2011 (Wu et al., 2011a) phenomena. The latter group is used to exclude changes deriving from SNPs. Since nucleotide 107 changes in the former sequencing datasets correspond to positive evidence of that sites undergoing 108 editing and the latter datasets correspond to negative evidence, we term these sets of datasets as 109 positive and negative datasets. RESIC is completely reference agnostic; the user provides 110 whichever reference file they wish to use for the alignment as well. Ambiguous read filtering 112 For ambiguous read filtering, we adopted the method of (Porath et al., 2014) . Briefly, we filtered 113 out the reads that meet the next criteria: one or more nucleotides represent over 60% or under 10% 114 of the read sequence, more than 10% Ns (when a base call could not be done), average Phred 115 quality score < 25, and more than 20 repeats of a single nucleotide in a row. Alignment scheme 117 We define an alignment scheme to be a 4-tuple = ( , , 1 , 2 ) where is an alignment 118 algorithm, is a list of alignment parameters for , 1 is a preprocessing function of the raw 119 datasets and 2 is a postprocessing function for aligned and misaligned reads. These seemingly 120 verbose definitions enable RESIC to decouple the choice of alignment algorithm from the rest of 121 the modules in RESIC. Let be an alignment scheme, be a sequencing dataset and be a genome reference, we define 123 , , and , , to be the aligned and misaligned read fractions resulting from running S on L and 124 R. We define S( , ) = ( , , , , , ) . 126 We say that a scheme is normal if 1 and 2 are identity functions in said scheme. Pseudo code for antisense reads as was not described elsewhere, to the best of our knowledge. First, for each and 143 nucleotides pairs, we first apply the scheme to the reads at the given node (see graph aligner) 144 and to the reference sense strand. Next, in order to identify hyper editing sites on the antisense 145 strand, for each and nucleotides pairs, we create the complement reference genome, based on 146 the original reference, and reverse the reads that were unmapped in the previous step, to achieve 147 the 3' -5' direction, same as the created reference. Then we reapply the 3nt alignment scheme while 148 considering the manipulation when recording the mapped reads as aligned to the antisense. In each step, after mapping the reads, aligned and unaligned reads are reverted to their original 150 sequence via custom python scripts. Figure S1 illustrates into details the 3nt genome alignment 151 scheme. To conduct the 3nt scheme we use awk (Aho et al., 1996) and sed (Free Software First, sites with no nucleotide changes and sites covered by less than reads are discarded. We 158 discarded sites from the positive datasets if those same sites appeared in any negative dataset with 159 a nucleotide change. Editing percent filtering 162 For each positive sequencing dataset, we filter out any site: (1) whose most abundant nucleotide 163 change constitutes less than 1 percent or more than 2 percent of the reads mapped to that site (2) 164 whose other nucleotide changes constitute over percent of the reads mapped to that site (3) whose 165 most abundant nucleotide change is in at least r reads. We term: 1 , the minimal editing percent 166 threshold, 2 , the maximal editing percent threshold, r, the editing read threshold, and u, the editing 167 noise threshold. Unique site filtering 169 We filter all sites that were defined as editing sites at the previous step, under more than one editing 170 category (e.g. non-repetitive and Hyper A to C), if they represented more than one type of 171 nucleotide change (e.g. once A to G and the other time A to C). Hyper editing filtering 173 Deriving from our method, it may be possible that under the hyper editing categories, a non-hyper 174 editing site would be recorded. Namely, for each pair of nucleotides X and Y that we perform the 175 3nt genome scheme, other nucleotide mismatches than hyper X to Y or Y to X may be noted, 176 enabled by the new conditions created by the 3nt scheme. Therefore, we filter the hyper editing 177 files to include only X to Y or Y to X changes (see an illustration in Figure S1 ). We filter out any sites that are not present in over c percent of positive datasets. A-to-I editing identification pipe 181 We implemented a hyper editing alignment scheme and built an alignment graph that could target 182 any editing type. Specifically, in the analysis described here we only applied RESIC to A-to-I RNA 183 editing. A-to-I RNA editing alignment graph 185 In our screen for A-to-I editing sites, we define two classes of alignment schemes, non-repetitive 186 alignment for reads that map uniquely to the genome and repetitive for repetitive regions or regions 187 that cannot be differentiated by our reads via alignment. Our graph alignment, summarized in 188 Figure S2 , is as follows: We align sequencing datasets using the non-repetitive normal scheme 189 followed by the repetitive normal scheme. Then we branch out and for each pair of distinct 190 nucleotides X and Y, we perform the non-repetitive 3nt genome scheme, and the repetitive 3nt was stranded (the sequenced strand must be from the actual expressed strand), we filtered the files 230 obtained to include only A-to-G, and not T-to-C sites. To perform differential expression analysis, we mapped the same unaligned reads that were used 232 for RESIC analysis before, to the human transcriptome version GRCh37 (hg19) using bowtie. Gene 233 expression levels were evaluated by read counts. We then compared our created gene counts to the 234 already processed counts downloaded from GEO: GSE147507. Although read count values were 235 not identical, as expected due to the use of different alignment tools, the trend was the same. phenomena, or a SNP database, to contradict editing-site existence. All datasets are first processed according to the graph alignment (Figure 1 A-B) . The graph 251 alignment was designed to track editing sites of different types by aligning a given set of reads to 252 the reference genome in a specific parameter configuration setup that represents each editing 253 class. Namely, we demanded unique or multiple alignment to detect nonrepetitive and repetitive 254 sites respectively, Next for sequences that did not align, we converted read-sequences to a three- In order to test the utility of the tool, we used RNA-seq datasets from 7 human tissues: adipose, and filtering criteria being set to consider sites as "editing sites" across tools. The distribution of the editing events divided into classes can be shown in (Figure S3 -Figure S8) . A-to-G and T-330 to-C are both considered as editing changes because the data is not stranded. Over all classes 331 being identified according to the graph alignment, A-to-G and T-to-C types were highly 332 enriched, as expected (Figure 2) . While non A-to-G mismatches are expected to be uncommon Table S2 ). We next annotated the unique sites for each sample type, across 374 different classes. It was apparent that the answer for our initial question is that among the unique Table S2) . 378 Interestingly, the gene APOBEC3C became hyper edited following SARS-CoV-2 infection 379 (Supplementary Table S3) , while in mock samples it was classified under the non-repetitive class. The APOBEC family of enzymes edits C-to-U RNA modifications and known to be involved in (Tables S5-S6) . 404 Given these observations, we reasoned that interrogating the differences in ADAR RNA expression 405 levels between SARS-CoV-2 and mock treated samples, would help to complete the picture. Collectively, we suggest that upon SARS-CoV-2 infection, compared to mock 1) the number of 425 hyper editing sites is increased; 2) ADAR1 activity is enhanced. The combination between these 426 two observations goes together with the finding that ADAR1 is the enzyme mostly catalyzing 427 hyper editing sites (Porath et al., 2014) . 428 We tested if these results hold true for more in-vitro SARS-CoV-2 infected cell types created in 429 the same study. For that purpose, we downloaded from GEO the already processed gene count 430 data, for A549 and NHBE cells infected with SARS-CoV-2 high-multiplicity of infection (MOI). We chose downloading the already processed gene count data after validating for Calu3 cells that Table S8 ). The non-significant upregulation of ADAR1 in the last case may be 438 because of the different cell types used. In any event, we concluded that unlike ADAR2 the 439 expression of ADAR1 is substantially different upon SARS-CoV-2, at least in some cell types. Figure S2 : RESIC graph alignment. Non-repetitive alignment is followed by repetitive alignment. This is followed by 3nt Genome scheme with non-repetitive alignment parameters and 3nt Genome scheme with repetitive alignment parameters for each pair of nucleotides. Awk ---A Pattern Scanning and 460 Processing Language Widespread A-to-I RNA editing of 462 Alu-containing mRNAs in the human transcriptome Critical Role of 465 MDA5 in the Interferon Response Induced by Human Metapneumovirus 466 Infection in Dendritic Cells and In Vivo Evidence for large diversity in the human transcriptome A-to-I RNA editing occurs at over a hundred million genomic sites, located 474 in a majority of human genes SARS-CoV-2 launches a unique transcriptional signature 478 from in vitro, ex vivo, and in vivo systems Adenosine-to-inosine RNA editing shapes transcriptome 561 diversity in primates Profiling RNA editing in human tissues: towards the inosinome REDItools: high-throughput RNA editing 567 detection made easy JACUSA: site-specific identification of RNA editing events from replicate 571 sequencing data Computational approaches for detection and 574 quantification of A-to-I RNA-editing A genome-wide map of hyper-577 edited RNA reveals numerous new sites ADAR RNA Modifications, the Epitranscriptome and Innate Immunity RADAR: a rigorously annotated database of 583 A-to-I RNA editing Identifying RNA editing sites using RNA sequencing data alone Transcriptome-wide sequencing reveals numerous APOBEC1 590 mRNA-editing targets in transcript 3' UTRs Understanding RNA 593 modifications: the promises and technological bottlenecks of the 594 'epitranscriptome' Interferon-stimulated 596 genes: a complex web of host defenses Repetitive DNA and next-generation 599 sequencing: computational challenges and solutions Double-stranded RNAs containing multiple 602 IU pairs are sufficient to suppress interferon induction and apoptosis RES-Scanner: 605 a software package for genome-wide identification of RNA-editing sites Competition between ADAR and 608 RNAi pathways for an extensive class of RNA targets FAS mRNA editing in Human Systemic Lupus Erythematosus Genome sequence-independent identification of 618 RNA editing sites Prediction of 620 constitutive A-to-I editing sites from human transcriptomes in the absence of 621 genomic sequences GeneMANIA prediction server 2013 update Detailed illustration of 3nt genome alignment scheme. (A) A reference genome The DNA sequence of the reference genome split by strand. Red, sense strand, presented in 5'-to-631 3' direction; Blue, antisense strand, presented in 3'-to-5' direction. The nucleotide sequences 632 complement each other. (C) G-A 3nt replacement for the reference genome Brown nucleotides 635 present G-to-A changes compared to the reference genome. Purple nucleotides present other 636 mismatches than A-to-G or G-to-A. Red, sense strand; Blue, antisense strand. (E) Alignment 637 following 3nt graph scheme -Part 1 Alignment 639 following 3nt graph scheme-Part 2, 3nt antisense: unmapped reads from part 1 are being reverted 640 (not shown) and reversed to fit the 3'-to-5' direction of the antisense reference strand. Gs are 641 converted into As in the cDNA reads. Hyper edited antisense reads are successfully being mapped 642 to the reference genome. (G) Reverting unmapped reads back from As to Gs and defining editing 643 sites Supplementary Table S2: characterization of editing landscape with respect to site locations and gene 671 annotation 672 Supplementary Table S3: unique genes under the SARS-CoV-2 nonrepetitive hyper-editing class and GO 673 enrichment analysis Supplementary Table S4: unique genes under the SARS-CoV-2 nonrepetitive class and GO enrichment 675 analysis Supplementary Table S5: unique genes under the mock nonrepetitive hyper-editing class and GO 677 enrichment analysis Supplementary Table S6: unique genes under the mock nonrepetitive class and GO enrichment analysis Supplementary Table S7: differential expression analysis for SARS-CoV-2 infected samples versus 680 mock, in the Calu3 cell line Supplementary Table S8: differential expression results of ADAR1 and ADAR2 genes CoV-2 infected samples versus mock