key: cord-102892-nt1zoktv
authors: Sweeney, Blake A.; Hoksza, David; Nawrocki, Eric P.; Ribas, Carlos Eduardo; Madeira, Fábio; Cannone, Jamie J.; Gutell, Robin; Maddala, Aparna; Meade, Caeden; Williams, Loren Dean; Petrov, Anton S.; Chan, Patricia P.; Lowe, Todd M.; Finn, Robert D.; Petrov, Anton I.
title: R2DT: computational framework for template-based RNA secondary structure visualisation across non-coding RNA types
date: 2020-09-11
journal: bioRxiv
DOI: 10.1101/2020.09.10.290924
sha: 
doc_id: 102892
cord_uid: nt1zoktv

Non-coding RNAs (ncRNA) are essential for all life, and the functions of many ncRNAs depend on their secondary (2D) and tertiary (3D) structure. Despite proliferation of 2D visualisation software, there is a lack of methods for automatically generating 2D representations in consistent, reproducible, and recognisable layouts, making them difficult to construct, compare and analyse. Here we present R2DT, a comprehensive method for visualising a wide range of RNA structures in standardised layouts. R2DT is based on a library of 3,632 templates representing the majority of known structured RNAs, from small RNAs to the large subunit ribosomal RNA. R2DT has been applied to ncRNA sequences from the RNAcentral database and produced >13 million diagrams, creating the world’s largest RNA 2D structure dataset. The software is freely available at https://github.com/rnacentral/R2DT and a web server is found at https://rnacentral.org/r2dt.

Introduction 7 based template are available for the same RNA, the 3D-based template is preferentially 113 selected. 114

The Ribovore software is used to search against all models except for tRNA. If no hits 115 are detected, tRNAscan-SE 2.0 is then used to compare the sequences against the 116 bacterial, archaeal, and eukaryotic domain-specific tRNA models. Once a top scoring 117 domain-specific tRNA model is chosen, the sequence is compared with the isotype-118 specific tRNA models for that domain. 119 2. The input sequence is folded with the Infernal cmalign program using the top scoring 120 covariance model. This ensures that the predicted 2D structure is compatible with the 121 template 2D structure. It is important to note that R2DT does not attempt to fold the 122 unstructured regions found in some templates or predict the structure of the insertions 123 relative to the template. For each sequence, the pipeline produces a text file with the 2D structure in dot-bracket notation 138 and a 2D diagram in SVG format. The diagrams are coloured depending on the identity of the 139 individual nucleotides in the input sequence relative to the template. Identical nucleotides are 140 shown in black, while inserted nucleotides are displayed in red. If a nucleotide is modified 141 compared to the template reference sequence, it is shown in green. If the location of the 142 nucleotides was automatically repositioned relative to its corresponding position in the template, 143 the nucleotide is coloured blue. 144

The SVG diagrams can be scaled to any resolution and edited using text editors or specialised 145 vector graphics editing software. When viewed with a web browser, additional information is 146

shown when hovering the mouse over individual nucleotides (for example, hovering over 147 modified nucleotides reveals the identity of the nucleotide in the corresponding position of the 148 reference sequence). Further interactivity can be added to the SVG visualisations using 149 JavaScript and CSS web technologies. 150

Comprehensive 2D structure template library 151 We compiled a library of 3,632 templates aggregating RNA 2D structure layouts from different 152 sources (Table 1) While the majority of the 3,632 templates were integrated from the existing sources (Table 1) , 164 103 templates have been manually curated specifically for this project, as described below (also 165 see Supplementary Table 1) . 166 New 3D structure based templates model rRNA expansion segments 167 The availability of the experimentally determined ribosomal 3D structures enabled us to improve 168 the traditional rRNA diagrams available from the CRW 2,19 . Specifically, the 3D structural data 169 assessed the accuracy of the covariation-based 16S and 23S rRNA secondary structures, 170 removed the few incorrect base pairs, added new base pairs with both Watson-Crick and non-171 canonical base pair conformations, and provided detailed modelling of the species-specific 172 expansion segments that were not present in the covariation-based expansion segments. The revised LSU 2D templates are outlined using single page layouts and explicitly depict H26a 20 , a 174 helix that connects the 5′ and 3′ halves of the LSU rRNA. This irregular helix, which is now 175 known to be the loop-E motif 21 was initially suggested by Gutell and Fox 22 , and had been 176 indicated by arrows connecting the two halves of the historical LSU rRNA layouts 23 . All non-177 canonical interactions were explicitly depicted when the first 3D structural model of the LSU 178 particle became available 24 . The single page LSU layouts enable R2DT to visualise the LSU 2D 179 structures automatically, which has not been possible until now (Figure 4a structures, we prepared 68 isotype-specific templates for bacterial, archaeal, and eukaryotic 210 tRNAs that include those decoding the standard twenty amino acids, initiator methionine/N-211 formylmethionine (tRNA iMet in archaea/eukaryotes or tRNA fMet in bacteria), isoleucine for the 212 AUA codon in bacteria and archaea, and selenocysteine ( Figure 5 ). Consensus tRNA primary 213 sequence with 2D structure for each isotype of each taxonomic domain was generated based 214 on the tRNA alignments used for building the isotype-specific covariance models in tRNAscan-215 SE 2.0 16 . The isotype-specific tRNA 2D structure templates were created using the 216 corresponding consensus sequences and structures. In addition, we generated six domain-217 specific templates for more general application. Due to the structural difference of variable loop 218 in type I and type II tRNAs 28 , alignments for building the domain-specific covariance models in 219 tRNAscan-SE 2.0 16 were divided into two sets. Similar to the isotype-specific ones, the domain-220 specific templates were built with the consensus sequences and structures for both type 221 categories of tRNAs. Together, the isotype-specific templates can be used to visualise 2D 222 structures of tRNAs with typical features while the domain-specific templates can be applied for 223 the atypical predictions with undetermined or inconsistent isotypes. The R2DT pipeline is designed to be extendable as new templates are added to the library. 233

Notably, R2DT can also serve as a tool for the development of new templates where the R2DT 234 output is used as a starting point for manual refinement of the 2D layouts. To facilitate the 235 workflow, we provide a modified version of the XRNA software 31 called XRNA-GT that supports 236 the import of the R2DT-generated SVG files and can be used to adjust the 2D layouts (for 237 example, change the orientation of RNA helices or edit base pairs). Using XRNA-GT it is also 238 possible to add custom annotations, such as helix or nucleotide numbers, in order to produce 239 publication-ready images. The updated 2D layouts can be submitted to the R2DT library where 240 they become new templates, upon review by the R2DT team. This workflow has been 241 successfully used internally to produce the 3D-based SSU templates. We welcome new 242 contributions from the community and provide detailed documentation on GitHub 243 (https://github.com/RNAcentral/R2DT#how-to-add-new-templates). 244

Validation of 2D diagrams 245 At the time of writing, there are no alternative methods that enable template-based RNA 2D 246 structure visualisation at a comparable scale. The only related method, implemented in 247 rPredictorDB 32 , has a small number of templates (56 as of July 2020) and a limited support for 248 alternative templates from the same RNA type (for example, species-specific rRNA templates). 249

As this is a unique dataset, we developed global benchmarks to assess both accuracy of the 250 template selection and the quality of the resulting 2D diagrams. 251

Evaluation of template selection 252 We tested R2DT with a diverse set of rRNA sequences to evaluate the template selection 253 process, focusing on the rRNA templates as they are annotated at the species level, making it 254 possible to compare the taxonomic lineages of the input sequence and the template. We 255 selected all rRNA sequences from RefSeq 33 shorter than 10,000 nucleotides (23,843 sequences 256 as of July 2020). The sequences were visualised with R2DT and the taxonomic trees of the 257 sequences and the selected templates were compared by identifying the most specific 258 taxonomic rank common to the templates and the RefSeq sequences. For example, if an rRNA 259 from Photorhabdus caribbeanensis was drawn using a template from Escherichia coli, their 260 respective phylogenies share the order Enterobacteriales, thus the sequence and the template 261 agree at the level of order. The majority of sequences match the templates at the level of 262 kingdom (55.5%), phylum (20.0%), or class (16.1%) (Supplementary Table 2 ), indicating that 263 the selected templates can be taxonomically distant from the input sequences. This effect is due 264 to the preferential use of the 3D-based SSU and LSU rRNA templates, as only a relatively small 265 number of 3D structures is available. However, when we classified each nucleotide in the 2D 266 diagrams based on whether it matched a template for each taxonomic rank separately, we 267 found that at least 94% of all nucleotides were in the same position as the template for all 268 taxonomic ranks, confirming that the sequences closely matched the selected templates despite 269 the phylogenetic distance between the template and sequence. 270 R2DT templates model the conserved core of most structured RNAs We classified each nucleotide in the resulting diagrams according to whether it matched a 282 template and found that 90.6% of nucleotides were displayed using the nucleotide locations 283 encoded in the templates, while 6.0% of nucleotides represented insertions compared to the 284 templates, and 3.4% of nucleotides matched the templates but required automatic repositioning 285 by the Traveler software (Table 2) . Overall 94.0% of the nucleotides were visualised using the templates. To further confirm the agreement between the templates and the diagrams, we 288 manually inspected 1,043 2D diagrams from human and E. coli (based on the HGNC and 289

EcoCyC sequences) to identify any modes of failure, such as overlapping structural regions. 290

This process identified only 24 suboptimal diagrams (2.3%) that were characterised by 291 overlapping helices and other artifacts (all diagrams can be seen in Supplementary Information) , 292 while the remaining 1,019 (97.7%) diagrams produced error free diagrams, indicating a close 293 correspondence between the template and rendered sequence. 294

To eliminate bias from the use of model organisms (which tend to have the most experimental 295 data), and to also demonstrate the scalability of R2DT, the nucleotide classification analysis was 296 extended to a broad range of sequences from a wide taxonomic distribution by processing all 297 ncRNA sequences from RNAcentral, aiming to test as many realistic use cases as possible. We present a comprehensive framework for the ongoing development of consistent, 317 standardised visualisations of RNA 2D structures. As new 2D structure templates are 318 introduced, the pipeline can be extended to cover new RNA types, including structured viral 319

RNAs. For example, as the Coronavirus-specific RNA families were added to the Rfam 320 database in response to the COVID-19 pandemic 43 , their 2D structures were included in the 321 template library to enable consistent visualisation of SARS-CoV-2 structured RNAs (Figure 6) , 322 such as the 5' and 3' UTRs and frameshifting signal (Rfam accessions RF03120, RF03125, and 323 RF00507, respectively). 324

The 2D structure diagrams produced by the pipeline represent computational predictions. R2DT establishes a framework that can be further extended and refined. Importantly, R2DT can 350 be used to generate starting versions of new templates that can be manually refined and 351 incorporated into the template library. For example, new rRNA sequences can be submitted to 352 R2DT, the species-specific expansion segment regions can be manually edited, and the 353 resulting diagram can be submitted to R2DT as described above. 354

In addition, we identified two areas for future development and improvements: 1) Expanding and 355 refining the template library. As new detailed 2D structures are published, we will integrate them 356 as templates into the R2DT library. In addition, R2DT will benefit from the ongoing development Isotype-specific consensus tRNA sequences and 2D structures were generated using R-scape 52 389 from the alignments that were used to train and build the corresponding covariance models in 390 tRNAscan-SE 16 . Alignments for training the domain-specific covariance models were split into 391 two subsets: 1) type I tRNAs (all except type, and 2) type II tRNAs (leucine, serine in bacteria, 392 archaea and eukaryotes, and tyrosine in bacteria). The bacterial tRNA alignments were further 393 filtered to include only one representative tRNA with the same anticodon in each genus due to 394 the original extra large training set (over 73,000 tRNAs). Consensus sequences and the 2D 395 structures of type I and II tRNAs for each domain were then generated using R-scape 52 as the 396 isotype-specific ones. R2R 9 was used for the initial image creation using consensus sequence. 397

Custom adjustments were then made to convert the positions of the images into typical tRNA 398 cloverleaf structure orientation. The templates correspond to tRNAscan-SE 2.0 covariance 399 models that are used to score input sequences against each isotype-specific set and pick the 400 highest scoring domain/template combination. The pseudogene tRNAs, as identified by 401 tRNAscan-SE 2.0, are not currently visualised. 402

Rfam 2D structure templates 403 For RNA families without a standard, community-accepted 2D structure layout, we adopted the 404 Rfam consensus 2D structures displayed using the R-scape 52 and R2R 9 software. The R2R 405 software uses a set of rules that lead to consistent diagrams with the standard position of the 5′ 406 and 3′ ends of the sequence. We excluded the lncRNA Rfam families, as well as families that Each sequence's best-matching model is used in the second round of cmsearch, executed with 428 the "--hmmonly" option, that again uses a profile HMM to score sequence only, but this time 429 executing the full HMMER3 filter pipeline such that accurate hit boundaries in sequence and 430 model coordinates are reported. While the second round of cmsearch is slower per 431 model/sequence comparison than the first, only one model is compared to each sequence 432 instead of all models. If the second cmsearch round identifies that there are multiple hits to the 433 model, this indicates that at least some of the input sequence (the intervening sequence 434 between adjacent hits) is either inserted relative to the model, or dissimilar from the expected 435 homologous model region. In this case, the sequence is not evaluated further and no structure 436 diagram will be drawn for the sequence. 437

Typically, profile HMMs and covariance models are built from multiple sequence alignments, but 438 the SSU and LSU rRNA models used in R2DT were built from the single sequence templates. 439 R2DT uses the Rfam covariance models built from the Rfam seed alignments. If, for a given 440 sequence, the first round of ribotyper.pl cmsearch results in zero models with a score above 20 441 bits indicating that no significant similarity has been detected to any models, then the second 442 cmsearch round is skipped and the sequence will be analyzed in a subsequent step by 443 tRNAscan-SE 2.0 to identify possible similarity against the tRNA models. 444

Visualising 2D structures using Traveler 445 To produce a layout for an input (target) structure, the Traveler software 17 requires the target 446 and template 2D structures accompanied by the template layout. Both the target and template 447 structures are turned into a tree-based representation, then, a minimum mapping between the 448 trees is found and the template layout is modified based on this mapping to fit the target 449 structure. To support the R2DT pipeline, two major modifications were made to the Traveler 450 software: i) the ability to provide custom mapping and ii) optimised hairpin rotation. 451

Since the target 2D structure is generated by Infernal within the R2DT pipeline, the target-452 template structure mapping is already known and the original Traveler's mapping procedure is 453 not needed. Therefore, for the purpose of R2DT, a new process was implemented that uses the 454 Infernal output with the target-template sequence mapping and produces an Infernal-informed 455 tree mapping which is used by Traveler. 456 457 Although in most cases the resulting layout is overlap-free, sometimes the target and template 458 differ in such a way that it is not easily possible to fit the target-specific portions of the structure 459 into the template. Therefore, a new overlap detection process was implemented in Traveler 460 allowing to rotate the overlapping parts of the structure so that the number of overlaps is 461 The R2DT source code is available on GitHub under the Apache 2.0 License 480 (https://github.com/rnacentral/R2DT). An R2DT web server can be found at 481 https://rnacentral.org/r2dt and its source code is available at https://github.com/RNAcentral/r2dt-482 web. A custom version of XRNA-GT is available at https://github.com/LDWLab/XRNA-GT. 483 tRNAscan-SE 2.0 software, and wrote the manuscript. RDF coordinated the project and wrote 500 the manuscript. AIP conceived and implemented the R2DT software, wrote the manuscript, and 501 coordinated the project. 502

Predicting and modeling RNA architecture

The Comparative RNA Web (CRW) Site: an online database of 508 comparative sequence and structure information for ribosomal, intron, and other RNAs

VARNA: Interactive drawing and editing of the RNA 513 secondary structure

Forna (force-directed RNA): Simple and 515 effective online RNA secondary structure diagrams

Tools for the automatic identification and classification of RNA base pairs

PseudoViewer: web application and web service for visualizing RNA 522 pseudoknots and secondary structures

R2R--software to speed the depiction of aesthetic consensus 524 RNA secondary structures

RNA2Drawer: geometrically 526 strict drawing of nucleic acid structures with graphical structure editing and highlighting of 527 complementary subsequences

Web servers 529 for RNA secondary structure prediction and analysis

Mfold web server for nucleic acid folding and hybridization prediction

Structural RNA homology search and alignment using covariance models

RNAcentral: a hub of information for non-coding RNA 536 sequences

Infernal 1.1: 100-fold faster RNA homology searches

tRNAscan-SE 2.0: Improved Detection and 540 Functional Classification of Transfer RNA Genes

TRAVeLer: a tool for template-based RNA secondary structure 542

Secondary structure and domain architecture of the 23S and 5S rRNAs

A common motif organizes the structure of multi-helix loops in 551 16 S and 23 S ribosomal RNAs

Additional Watson-Crick interactions 553 suggest a structural core in large subunit ribosomal RNA

Secondary structure model for 23S ribosomal RNA

The complete atomic structure 558 of the large ribosomal subunit at 2.4 A resolution

Evolutionary Characteristics of 16S and 23S rRNA Structures

rPredictorDB: a predictive database of individual secondary structures of

RNAs and their formatted plots

Reference sequence (RefSeq) database at NCBI: current status, 577 taxonomic expansion, and functional annotation

DictyBase 2013: integrating multiple Dictyostelid species

FlyBase 2.0: the next generation

2018: knowledgebase for the 583 laboratory mouse

PomBase 2015: updates to the fission yeast database

Saccharomyces Genome Database: the genomics resource of budding 587 yeast

The Arabidopsis information resource: Making and mining the 'gold 589 standard' annotated reference plant genome

WormBase 2012: more genomes, more data, new website

Genenames.org: the HGNC and VGNC resources in 2017

XRNA: Auto-interactive program for modeling RNA. The Center 601 for Molecular Biology of RNA

/pub/XRNA

Secondary structures of rRNAs from all three domains of life

Nonredundant 3D Structure Datasets for RNA Knowledge 608 Extraction and Benchmarking

The Protein Data Bank

FR3D: finding local 613 and composite recurrent structural motifs in RNA 3D structures

RiboVision suite for visualization and analysis of ribosomes. Faraday 616 Discuss

A statistical test for conserved RNA structure shows 618 lack of evidence for structure in lncRNAs

Accelerated Profile HMM Searches

DNA homology search with profile HMMs

The authors would like to thank the RNAcentral Consortium for contributing data to RNAcentral 485

The authors declare no competing interests. 504