key: cord-0919735-jb447ndz authors: ponti, Riccardo Delli; Armaos, Alexandros; Marti, Stefanie; Tartaglia, Gian Gaetano title: A method for RNA structure prediction shows evidence for structure in lncRNAs date: 2018-04-16 journal: bioRxiv DOI: 10.1101/284869 sha: df6013a1630aa41627cc84c8e780d40ff93785cc doc_id: 919735 cord_uid: jb447ndz To compare the secondary structures of RNA molecules we developed the CROSSalign method. CROSSalign is based on the combination of the Computational Recognition Of Secondary Structure (CROSS) algorithm to predict the RNA secondary structure at single-nucleotide resolution using sequence information, and the Dynamic Time Warping (DTW) method to align profiles of different lengths. We applied CROSSalign to investigate the structural conservation of long non-coding RNAs such as XIST and HOTAIR as well as ssRNA viruses including HIV. In a pool of sequences with the same secondary structure CROSSalign accurately recognizes repeat A of XIST and domain D2 of HOTAIR and outperforms other methods based on covariance modelling. CROSSalign can be applied to perform pair-wise comparisons and is able to find homologues between thousands of matches identifying the exact regions of similarity between profiles of different lengths. The algorithm is freely available at the webpage http://service.tartaglialab.com//new_submission/CROSSalign. managing profiles of different lengths without having to sacrifice computational time. We applied CROSSalign on lncRNAs of different species, compared it with covariation models, as well as on ssRNA viruses. CROSSalign is able to find structural homologues among millions of possible matches identifying structural domains with great accuracy. To test the performances and functionality of CROSS combined with DTW ( Supplementary Figures 1 and 2) , we selected a dataset of 22 structures for which crystallographic (exact base pairing between nucleotides) and Selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE; chemical probing of flexible regions used to assess whether a nucleotide is in double-or single-stranded state) data are available [12] . Using DTW, we calculated the structural distances between all possible pairs in the dataset considering crystallographic (dots and parentheses were transformed into binary code) data as well as 1) SHAPE profiles ( [20] . We then selected the D2 domain of HOTAIR to measure its conservation in 10 species [21] using CROSSalign (Supplementary Figure 9A) . As for XIST, the structural distance analysis indicates that primates cluster close to human, and other species are more distant (Supplementary Figure 9B) . Orangutan D2 was then searched for within all human lncRNAs, and HOTAIR was identified as the best match (structural distance 0.032; p-value < 10 -6 ) with overlapping coordinates (nucleotides: 666-1191; 78% overlap with the query region; Figure 6A ). Searching for mouse D2 within all human lncRNAs, HOTAIR was found as the best (0.092; p-value < 10 -4 ) and matching position (nucleotides: 284-788; 57% overlap; Figure 6B ). These results suggest that D2 secondary structure is not only conserved in primates but also in mouse. To further investigate the secondary structure of HOTAIR, we studied the structural conservation of the D4 domain (Supplementary Figure 10) . As opposed to D2, D4 is predicted by CROSS to be poorly structured ( Figure 5B ). Searching for orangutan D4 within all human lncRNAs yields HOTAIR as the best match (structural distance 0.023; p-value < 10 -6 ) and the reported sequence position shows a sizeable overlap with the D4 domain in human HOTAIR (predicted coordinates: 1650-2291; overlap of 79%; Figure 7A ). By contrast, when searching for mouse D4 within all human lncRNAs, HOTAIR shows poor ranking (1849 th ; structural distance 0.104; p-value = 0.061), which indicates little structural homology between human and mouse ( Figure Table 3) to identify structurally similar domains. We found that coronavirus HKU and Simian-Human immunodeficiency SIV have the most significant matches with HIV (structural distance 0.078, p-value < 10 -6 for SIV; structural distance 0.093, p-value < 10 -4 for HKU). This finding is particularly relevant since SIV and HIV share many similarities in terms of pathogenicity and evolution [29] . Indeed, previous studies already reported a similarity in terms of secondary structure between HIV and SIV that is not explained by sequence similarity [30] . In addition, we found that the HIV 5' region is structurally similar to a strain of Ebola Figure 8A ). This result indicates that the secondary structure of this region is not only necessary for HIV encapsidation [32] , but is also essential for the activity of other viruses. We also compared structural distances and sequence similarities of all HIV strains (4804; see Materials and Methods). We found two clusters (brown and red; Figure 8B ) that are similar in terms of structure (~0.06 structural distance; p-value < 10 -6 ) and sequence (80%-95% sequence similarity). Other clusters (red and green; Figure 8B ) showed significant distance in structure (from 0.06 to 0.09 of structural distance; pvalue < 10 -6 ) that is not identifiable by sequence similarity (~85% sequence similarity). This result suggests that HIV could have evolved maintaining a similar sequence but different structures, as previously reported in literature [30] . author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/284869 doi: bioRxiv preprint We developed the CROSSalign method based on the combination of the CROSS algorithm to predict the RNA secondary structure at single-nucleotide resolution [10] and the Dynamic Time Warping (DTW) algorithm to align profiles of different lengths [11] . DTW has been previously applied in different fields, especially pattern recognition and data mining [34, 35] , but has never been used to investigate structural alignments of RNA molecules. Since CROSS has no sequence length restrictions and shows strong performances on both coding and non-coding RNAs [10] the combination with DTW allows very accurate comparisons of structural profiles. Thermodynamic approaches, such as RNAstructure [14] or RNAfold [12] , cannot be directly used for such a task since they are restricted on the sequence length [36] . We applied CROSSalign to investigate the structural conservation of lncRNAs in different species and the complete genomes of ssRNA viruses. We found that the algorithm is able to find structural homologues between thousands of matches and correctly identifies the regions of similarity between profiles of different lengths. The results of our analysis reveal a structural conservation between known lncRNA domains including XIST RepA (best hit out of 8176 cases; 95% overlap with the query region) and HOTAIR D2 (best hit out of 8176 cases; 78% overlap with the To study how sequence similarity is related to structural similarity we created different sequences with the same secondary structure as RepA (XIST) and D2 (HOTAIR) . To generate the reference structure we used RNAfold [41] . We then generated different sequences encoding for the same previously generated structure. For this task we used the command line version RNAinverse from the Vienna suite [12] . The tool was launched using standard parameters to generate 50 sequences for each structure. The user should paste one or two RNA sequences in FASTA format into the dedicated form, providing an email address (optional) to receive a notification when the job is completed. The algorithm can be launched in 4 different modes, each of them being a specific variation of the DTW algorithm (Supplementary Figure 1) . • The standard-DTW is recommended for comparing structures of RNAs with similar lengths (i.e., one sequence is less than 3 times longer than the other). • OBE-DTW (open begins and ends) is a specific mode to search for a shorter profile within a longer one. This is the recommended mode when comparing profiles of very different sizes (i.e., one sequence is more than 3 times longer than the other). Please keep in mind that the sequence in the form of RNA input 1 will be searched for within RNA input 2, so the sequence in RNA input 1 should be shorter than the other. as a .txt file. The same output is used for the dataset mode, but in this case the table can only be downloaded. author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/284869 doi: bioRxiv preprint Structural similarities of human RepA against all mouse lincRNAs using doublestranded nucleotides (nucleotide with CROSS score < 0 are set to 0). Human RepA is identified as the best match (colored in red), which highlights the importance of the structural content for the regulatory domains of the lncRNAs. (C) Secondary structure profile of human RepA, obtained as optimal path with OBE-DTW, compared with the best match in mouse lincRNAs (Mirg; ENSMUSG00000097391). The two secondary structure profiles show a strong a correlation (0.92). Percentage of sequence similarity Structural distance Percentage of sequence similarity Sequence CROSS score correlation=0.92 Percentage of sequence similarity orangutan D4 vs human lincRNAs human D4 vs mouse lincRNAs