key: cord-103853-ar09nzmw authors: Wayment-Steele, Hannah K.; Kladwang, Wipapat; Das, Rhiju title: RNA secondary structure packages ranked and improved by high-throughput experiments date: 2020-05-31 journal: bioRxiv DOI: 10.1101/2020.05.29.124511 sha: doc_id: 103853 cord_uid: ar09nzmw The computer-aided study and design of RNA molecules is increasingly prevalent across a range of disciplines, yet little is known about the accuracy of commonly used structure prediction packages in real-world tasks. Here, we evaluate the performance of current packages using EternaBench, a dataset comprising 23 in vitro structure mapping and 11 riboswitch activity datasets involving 18,509 synthetic sequences from the crowdsourced RNA design project Eterna. We find that CONTRAfold and RNAsoft, packages with parameters derived through statistical learning, achieve consistently higher accuracy than more widely used packages like the ViennaRNA software, which derive parameters primarily from thermodynamic experiments. Motivated by these results, we develop a multitask-learning-based model, EternaFold, which demonstrates improved performance that generalizes to diverse external datasets, including complete viral genomes probed in vivo and synthetic designs modeling mRNA vaccines. RNA molecules perform essential roles in cells, including regulating transcription, translation, and molecular interactions, and performing catalysis. 1 Synthetic RNA molecules are gaining increasing interest and development for a variety of applications, including genome editing, 2 biosensing, 3 and vaccination. 4 Characterizing RNA secondary structure, the collection of base pairs present in the molecule, is typically necessary for understanding the function of natural RNA molecules and is of crucial importance for designing better synthetic molecules. Some of the most widely-used packages use a physics-based approach 5 that assigns thermodynamic values to a set of structural features (Vienna RNAfold, 6 NUPACK, 7 and RNAstructure 8 ), with parameters traditionally characterized via optical melting experiments and then generalized by expert intuition. 9 However, a number of other approaches have also been developed that utilize statistical learning methods to derive parameters for structural features (RNAsoft, 10 CONTRAfold, 11 CycleFold, 12 LearnToFold, 13 MXfold 14 , SPOT-RNA 15 ). Secondary structure modeling packages are typically evaluated by comparing single predicted structures to secondary structures of natural RNAs 16 . While important, this practice has limitations for accurately assessing packages, including bias toward structures more abundant in the most well-studied RNAs (tRNAs, ribosomal RNA, etc.) and neglect of energetic effects from these natural RNAs' tertiary contacts or binding partners. Furthermore, scoring on single structures fails to assess the accuracy of ensemble-averaged RNA structural observables, such as base-pairing probabilities, affinities for proteins, and ligand-dependent structural rearrangements, which are particularly relevant for the study and design of riboswitches 17, 18 , ribozymes, pre-mRNA transcripts, and therapeutics 19 that occupy more than one structure as part of their functional cycles. Existing packages are, in theory, capable of predicting ensemble properties through so-called partition function calculations, and, in practice, are used to guide RNA ensemble-based design, despite not being validated for these applications. New sources of structural data now offer opportunities to evaluate secondary structure packages incisively and with less bias. In particular, experimental data accumulating in the RNA design crowdsourcing platform Eterna has potential to mitigate effects of bias incurred in natural RNA datasets, as the probed constructs are designed by citizen scientists to fold to synthetic 3 secondary structures of their choosing. 20 In this work, we evaluate the performance of commonly-used packages capable of making thermodynamic predictions in two tasks that have been crowdsourced on Eterna and are emerging as central to RNA characterization and design: 1) predicting chemical reactivity data through calculating probabilities that nucleotides are unpaired, and 2) predicting relative stabilities of multiple structural states that underlie the functions of riboswitch molecules, a task that involves predicting affinities of both small molecules and proteins of interest. Both of these evaluations are made possible through EternaBench, a collection of high-throughput datasets acquired in Eterna. We find striking, consistent differences in package performance across these quantitative tasks, with the packages CONTRAfold and RNASoft performing better than packages that are in much wider use. Furthermore, we develop a multitask-learning-based framework to train a thermodynamic model on these tasks concurrently with the task of single-structure prediction. The resulting multitask-trained model, called EternaFold, demonstrates increased accuracy both on held-out data from Eterna as well as completely independent datasets encompassing viral genomes and mRNAs probed with distinct methods and under distinct solution and cellular conditions. We conclude with suggestions of how RNA thermodynamic secondary structure models might continue to be improved and with discussion of design applications. We evaluated commonly used secondary structure modeling packages in their ability to make thermodynamic predictions on a compilation of large datasets of diverse synthetic molecules from Eterna, which we termed EternaBench ( Figure 1A ). The packages Vienna RNAfold, NUPACK, RNAstructure, RNAsoft, and CONTRAfold, were analyzed across different package versions, parameter sets, and modelling options, where available (Supplementary Table 2 ). Packages discussed in the main text reflect their standard settings; the performance of other package options are included in Supplementary Information. We also evaluated packages trained more recently through a varied set of statistical or deep learning methods (LearnToFold, SPOT-RNA, MXfold, and CycleFold), but these packages demonstrated poor performance ( Figure S1 , S9) and have been omitted from further discussion. Our first ensemble-based structure prediction task investigates the capability of these packages to predict chemical mapping reactivities. Chemical mapping is a widely-used readout of RNA secondary structure 21-23 and has served as a high-throughput structural readout for experiments performed in the Eterna massive open online laboratory 20 . A nucleotide's reactivity in a chemical mapping experiment depends on the availability of the nucleotide to be chemically modified, and hence provides an ensemble-averaged readout of the nucleotide's deprotection from base pairing or other binding partners. 24 We wished to investigate if current secondary structure packages differed in their ability to recapitulate information about the ensembles of misfolded states that are captured in chemical mapping experiments. We used the Eterna "Cloud Labs" for this purpose: 23 datasets of player-designed constructs, ranging from 80-130 nucleotides in length, all of which were chemically modified by selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) and read out using the MAP-seq chemical mapping protocol. 25 After filtering for sequences with > 80% redundancy, these 23 datasets comprise 11,903 individual constructs ( Figure 1A , "EternaBench-CM"). Figure 2A shows an example heatmap of SHAPE data for Eterna-player-designed synthetic RNA molecules from the Eterna Cloud Lab Round 69, the round with the most 5 constructs and a relatively high signal-to-noise ratio ( Figure S2 ). Figure 2B shows two of these player-designed constructs in the minimum free energy (MFE) structures predicted in the Eterna gameplay with Vienna RNAfold 1.8.5 ("Vienna 1"), colored by experimental SHAPE reactivity data. Vienna 1 predicted that both sequences would form the same structure, and low reactivity in stem regions in construct (i) indicates that the molecule predominantly folded to the target structure. However, high reactivity values in the stems of construct (ii) indicate that the molecule did not fold to the correct structure, and instead remained largely unfolded, in conflict with the predictions of Vienna 1. Figure 2C depicts package-predicted unpaired probabilities per nucleotide, punp, plotted in the same heatmap arrangement as the experimental data in Figure 2A (see Figure S3 for example heatmaps for all package options tested). In this subset of constructs, all packages are largely able to identify which regions should be completely paired (punp = 0, white) or unpaired (punp = 1, black), but some packages are able to predict punp values between 0 and 1 that more Table 3 , and pairwise significance tests are in Figure S4 . We computed the correlation coefficients between chemical mapping data and punp Figure S5 ). For all packages, correlation coefficients were weakly positively associated with number of constructs in the dataset and strongly negatively correlated with the experimental noise in the data ( Figure S2 ), confirming the importance of having large, high-quality data sets in evaluating these secondary structure packages. We observed that CONTRAfold and RNAsoft generally predict that the constructs studied are more melted than the other packages predict at their default temperatures of 37 °C, even though the actual chemical mapping experiments were carried out at lower temperature (24 °C; see Methods). Motivated by this observation, we wished to ascertain if a simple change in temperature might account for differences in performance between packages. To address this, Vienna 2, NUPACK, and RNAstructure packages include parameters for both enthalpy and entropy, allowing for altering the temperature used in prediction. We found that increasing the temperature from the default value of 37 ˚C used in these packages to 60-70 °C improved their correlation to experimental data, but the maximum correlation values did not surpass the correlation observed from CONTRAfold or RNAsoft with default settings ( Figure S6 ). Our second ensemble-based structure prediction task involved predicting the relative populations of states occupied by riboswitch molecules. Riboswitches are RNA molecules that alter their structure upon binding of an input ligand, which effects an output action such as regulating transcription, translation, splicing, or the binding of a reporter molecule. 18, 26, 27 We compared these packages in their ability to predict the relative binding affinity of synthetic riboswitches to their output reporter, fluorescently-tagged MS2 viral coat protein. These riboswitches were designed to use a small molecule input (flavin mononucleotide (FMN), tryptophan, or theophylline) to regulate formation of the MS2 hairpin aptamer ( Figure 1A , "EternaBench-Switch"). These riboswitches came from two sources: the first consisted of 4,404 riboswitches designed by citizen scientists on Eterna, 28 filtered from the original dataset to exclude sequences with over 80% similarity. The second consisted of 2,202 riboswitches designed fully computationally using the RiboLogic package, 29 probed concomitantly with Eterna riboswitches. Three important metrics that secondary structure packages can predict 17 for these datasets (depicted in Figure 3A , overview of experimental measurements in Figure S7 Figure S8 , and correlations and pairwise significance tests for all package options are in Figure S9 and Supplementary Table 4 . When performance across 11 independent experimental datasets was evaluated ( Figure 3C , Figure S10 ) and pairwise comparisons bootstrapped over all datasets, we obtain a ranking, from lowest to highest, of NUPACK, Vienna RNAfold, RNAstructure, RNAsoft Blstar, and CONTRAfold, identical to the ranking obtained from chemical mapping data ( Figure 3C , 3D, pairwise significance tests per dataset in Figure S11 ). Another metric to evaluate predictive accuracy is the root mean-squared 8 error (RMSE) between predicted and experimental values; this metric revealed a similar ranking led by CONTRAfold ( Figure S12 , Supplementary exhibited higher correlation to experimental values than Vienna 2 across the majority of data sets (>99% bootstraps, Figure 3D , Figure S14 ). For log AR, CONTRAfold predictions also exhibited higher correlation than Vienna 2 (87% of bootstraps; Figure 3E , Figure S15 ). Overall, these comparisons consistently ranked CONTRAfold 2 as the most accurate package for modeling riboswitches; this ranking matches the entirely independent ranking based on chemical mapping measurements of distinct RNA sequences described in the previous section. Given that the two packages that performed best in both structure prediction tasks were developed with statistical methods, we hypothesized that performance in these two tasks might be improved by incorporating these tasks in the process of training a secondary structure package. Multi-task learning, the process of learning tasks in parallel using a shared model, has proven useful in image classification and natural language processing. 30 For learning to be transferred across tasks, the tasks must share significant commonalities and structure. However, multi-task learning might fail if modelling assumptions made for one task do not hold across other tasks, or if the model's representational capacity is not large enough to correctly model all the data types. We used the CONTRAfold code as a framework to explore multi-task learning on RNA structural data, since it has previously been extended to train on chemical mapping data to maximize the expected likelihood of chemical mapping data. 31 We further extended the CONTRAfold loss function to include a term to minimize the mean squared error 9 of riboswitch affinities for MS2 protein (Online Methods). We trained models with a variety of combinations of data types to explore interactions in multitask training (Table 1) and evaluated performance on held-out test sets for single-structure prediction accuracy, chemical mapping prediction accuracy, and riboswitch fold change prediction. For single-structure data, we used the S-Processed dataset 32 used previously in training CONTRAfold 2 and RNAsoft 33 . Comparing performance across models trained with different types of input data indicates some tradeoffs in performance. Model "S", trained only on the S-Processed structure dataset, exhibited the highest accuracy for a held-out single-structure prediction test set (0.71 (1)), outperforming models that included training on other data types. Likewise, Model "R", trained only on riboswitch )*# $%&' values, exhibited the highest performance in a held-out riboswitch )*# $%&' prediction test set. However, Model "SCRR", trained on four data types (structure, chemical mapping, riboswitch )*# $%&' and )*# (%&' ) exhibited the highest performance on riboswitch )*# (%&' and chemical mapping, and its performance was within error of Models "S" and "R" on single structure prediction and )*# $%&' prediction test sets, respectively. We termed this SCRR model "EternaFold". We wished to test if EternaFold's improvements in recovering Eterna measurements generalized to improvement in predictions for datasets from other groups, experimental protocols, and RNA molecules. We first tested the ability of EternaFold to predict the thermodynamics of protein binding not included in its training data, analogous to the MS2 binding comparisons above. We made use of a large data set of precisely measured DDG (Table 2 , Figure 4B ). Most of these test molecules were much longer (thousands of nucleotides) than the 85-nucleotide RNAs used as the primary training data for EternaFold. Nevertheless, compared to all other packages tested, EternaFold exhibited the highest correlation in 6/19 datasets with p < 0.01 and an additional 3/19 datasets with p < 0.05 ( Figure 4D , Figure S17 , Supplementary Table 6 , and demonstrated the highest correlation in a pairwise significance analysis bootstrapped over all datasets ( Figure 4C ). We were curious as to whether the differences in packages arise from consistent accuracy differences across all regions of these RNAs or from a net balance of increased and decreased accuracies at specific subregions of the RNAs, which might reflect particular motifs that are handled better or worse by the different packages. Figures In this work, we have established EternaBench, benchmark datasets and analysis methods for evaluating package accuracy for two modeling tasks important in RNA structural characterization and design. These include 1) predicting unpaired probabilities, as measured through chemical mapping experiments, and 2) predicting relative stabilities of different conformational states, as exhibited in riboswitch systems. We discovered that RNAsoft and Further advances in thermodynamic structure prediction might come from incorporating more data in a data-driven framework like EternaFold as well as advances in modelling. To analyze how much accuracy increases from incorporating more data, we compared differences in accuracy in models trained on the holdout dataset, roughly 40% the size of the full training dataset ( Figure S20 ). Figure S21 ). Further data-driven investigations will be necessary to improve performance and aspects of the model that need to expanded, which may include noncanonical pairs 43 , more sophisticated treatment of junctions 44 , next-nearest-neighbor effects 14 , and chemically modified nucleotides 45, 46 . This work demonstrates that RNA secondary structure prediction can be methodically evaluated and improved by combining thermodynamic predictions with statistical learning on diverse, high-throughput experimental datasets. As more sources of structural data continue to be made available in higher quantity, resolution, and scope, the EternaBench and EternaFold frameworks will allow for the usage of these datasets in continually assessing and improving RNA structure prediction. An important use case for EternaFold will be computationally guided design of RNA medicines, including structured mRNAs 19 for viral pandemics like COVID-19. EternaFold already affords increases in predictive power for structure mapping data acquired for synthetic mRNAs that model mRNA vaccines as well as for alphaviruses, which form the basis of self-amplifying mRNA vaccines 47, 48 . These observations suggest immediate applications for the packages herein and potentially rapid further improvements in EternaFold as further structural data are acquired on these new RNA molecules. In Vitro/ In Vivo chemical mapping The algorithms evaluated in this work model secondary structure in the following manner. Given a model , which is comprised of a set of structural features { }, the partition function of an RNA sequence x is computed as where ( & ) is the free energy contribution of structural feature k, -is Boltzmann's constant, and T is temperature. Z represents a sum over the set of all possible structures {S}. From this expression, the probability of any particular structure s is defined as . (2) Structure prediction algorithms are able to estimate the ensemble-averaged probability that a nucleotide is paired or unpaired. Let ( : | , ) be the probability of bases i and j being paired, given sequence x and model . This may be computed as The relationship between the probability of a nucleotide being unpaired and its experimentallymeasured reactivity has served as a locus for many efforts for improving structure prediction of RNA constructs incorporating chemical mapping data from those constructs, and several 2 functional forms have been used to describe the relationship between unpaired probability and chemical mapping reactivity [1] [2] [3] . In this work, we use the correlation coefficient between unpaired probability and experimentally-measured reactivity as a nonparametric measure of model quality. A thermodynamic framework discussed in greater detail in ref. 4 allows us to relate the observed binding affinity of an output molecule to the relative populations of a riboswitch molecule in different states. In the absence of input ligand, we may relate the probability that a riboswitch adopts a structural feature that can bind its output, ( ), to an experimentally-measured binding affinity, K 456 .718 , via the relative ratios of both values to those of a reference state: We selected the MS2 hairpin aptamer as a reference state whose probability of forming, @AB ( ), can be estimated by the secondary structure algorithm. For each separate independent experimental dataset, ;C) @AB is estimated as the highest affinity measured ( Figure S7 ). We refer to the estimated ratio Figure 3A ). Although there may be error introduced in which experimental point is selected to be ;C) @AB , relative error should be constant when comparing packages on the same dataset. To compare packages, we report the correlation between log( ;C) ±718 / ;C) @AB ) and log( >+? ±718 ), which excludes the effect of selection for K 456 @AB . However, the Root Mean Squared Error (RMSE) between predicted and experimental values is also of interest to be able to consider error in terms of 3 energetic units (kcal/mol) and is reported in Supplementary tables S4, S5, S6, S7, and plotted in Figure S11 for the Ribologic-FMN dataset. In general, the probability of any an RNA molecule forming any structure motif is computed as where E;=1B denotes a structure containing that motif. Computing this probability requires a dynamic programming routine that is able to constrain the sampled structure space to only structures containing that motif to estimate a so-called "constrained partition function". However, not all secondary structure algorithms have implemented constrained partition function estimation. Because the MS2 aptamer is a hairpin, we can approximate its probability of forming as the probability of forming the final base pair of the MS2 hairpin aptamer (colored pink in Figure 3A ), an experimental observable that can be estimated by all the packages tested here. Thus, our prediction of interest is where i and j are the nucleotides forming the terminal base pair in the MS2 aptamer stem. The value @AB ( : ) is accordingly computed as the probability of closing the base pair in the reference sequence. We confirmed that calculations using eqn. 5 and eqn. 7 agree for Vienna and CONTRAfold packages. RNAstructure is capable of computing constrained partition functions, but the constrained partition function calculations did not match those from base pair probabilities ( Figure S13 ). Hence, RNAstructure constrained partition function predictions were excluded from comparisons. The estimation of K G4HI J718 follows similarly to above but must take into account increased thermodynamic weights for states that correctly display the aptamer of the input small molecule 4 ligand. Therefore, it cannot be estimated via the simplified single base pair calculation and must make use of constrained partition functions (eq. 6). For all constructs as well as the reference MS2 hairpin construct, we performed >+? ± 718 estimations including a flanking hairpin included in the Illumina array experiments (described in ref. 5 ). In example, the full reference MS2 hairpin construct, as well as the constraint used for estimating :@AK @AB with constrained-partition-function-based estimation, is In brief, the CONTRAfold 7 loss function optimizes the conditional log-likelihood of ground-truth structure (1) given sequence (1) over dataset : In CONTRAfold-SE 8 , the authors include a term to also use chemical mapping data to optimize structure prediction by maximizing the likelihood of observing the included chemical 6 The datasets used here for evaluation, as well as scripts and Python notebooks for reproducing the filtered datasets and the chemical mapping and riboswitch fold change calculations described here, are available at www.software.eternagame.org in the package "EternaBench". The code for training EternaFold, as well as the training and test sets used, are available at www.software.eternagame.org as the package "EternaFold". The EternaFold code is derived from the CONTRAfold-SE 8 codebase, which is derived from the CONTRAfold 7 codebase. A server to run EternaFold is being made available for noncommercial use. All base-pairing probability calculations and constrained partition function calculations were performed using standardized system calls through Python wrappers developed in Arnie (www.github.com/DasLab/arnie). Example command-line calls for each package option evaluated are provided in Supplementary Table 2 . Datasets were handled with Pandas (https://github.com/pandas-dev/pandas) and visualized with Seaborn (https://seaborn.pydata.org/). The following significance test was performed to make pairwise package comparisons for each type of data. For each bootstrapping round, the datasets within the data type (for instance, the 23 Cloud Lab chemical mapping rounds) were sampled with replacement, and within each of the sampled datasets, the datapoints were sampled with replacement. The correlations for both packages on the sampled datasets were calculated, and the number of datasets for which package A had a higher correlation than package B was recorded. If package A had a higher correlation in more than 50% of the datasets, this was counted as a "win" for package A. and processed with RDATKit (https://ribokit.github.io/RDATKit/). The RNA was probed with the MAP-seq protocol with a co-loaded standard molecule (P4-P6-2HP RNA) to enable normalization, as described in ref. 10 ; measurements were carried out at ambient temperatures (24 ˚C) with 10 mM MgCl2, 50 mM Na-HEPES, pH 8.0. Within each chemical mapping dataset, CD-HIT-EST 11 was used to filter sequences with greater than 80% redundancy (excluding a shared 3' primer binding site). From each sequence cluster identified, the sequence with the highest signal-to-noise ratio from chemical mapping experiments was selected as the representative sequence. Nucleotides with reactivities less than zero or greater than the 95th percentile of the dataset were removed from analysis. Cloud Lab Round 70 was filtered to exclude certain experiments that had FMN present, pertaining to Eterna Cloud Lab challenges to design riboswitches. Adenosine nucleotides preceded by 6 or more As were also removed due to evidence of anomalous transcription effects in such stretches 12 , though this removal was not shown to alter package correlations to data (data not included). External chemical mapping datasets were obtained from the supplementary information from the papers and processed similarly (outliers, nucleotides in poly-A stretches removed). For molecules longer than 600 nucleotides, p(unp) predictions were performed using a sliding window size of 600, with overlapping regions of length 25. Changing the window size was not shown to affect correlation values ( Figure S18 ). The eukaryotic genome as an RNA machine Exploring the potential of genome editing CRISPR-Cas9 technology RNA-Based Fluorescent Biosensors for Detecting Metabolites in vitro and in Living Cells Introduction to RNA Vaccines Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information ViennaRNA Package 2.0 NUPACK: Analysis and design of nucleic acid systems RNAstructure: software for RNA secondary structure prediction and analysis Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs RNA secondary structure prediction without physics-based models Measurements were carried out at 37 °C in 100 mM Tris-HCl, pH 7.5, 80 mM KCl, 4 mM MgCl2 mg/mL BSA, 1 mM DTT, 0.01 mg/mL yeast tRNA, 0.01% Tween-20, and varying concentrations of small molecule ligand (FMN, tryptophan, theophylline) and MS2 coat protein Datasets were filtered to only include constructs with sequences that included the canonical MS2 and small molecule aptamers, and filtered using CD-HIT-EST 11 to remove sequence redundancy over 80% RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction Data-directed RNA secondary structure prediction using probabilistic modeling Computational Analysis of Conserved RNA Secondary Structure in Transcriptomes and Genomes Evaluating riboswitch optimality Crowdsourced RNA design discovers diverse, reversible, efficient, self-contained molecular sensors. bioRxiv Automated Design of Diverse Stand-Alone Riboswitches RNA secondary structure prediction without physics-based models Learning RNA secondary structure (only) from structure probing data An RNA Mapping DataBase for curating RNA structure mapping experiments Standardization of RNA Chemical Mapping Experiments CD-HIT: accelerated for clustering the nextgeneration sequencing data Anomalous reverse transcription through chemical modifications in polyadenosine stretches. bioRxiv We thank members of the Das and Barna labs (Stanford University), C. Pop, and C.-S. Foo for useful discussions. We thank I. Jarmoskaite, V. V. Topkar, R. Wellington-Oguri, and J. Townley for helpful comments on the manuscript. Calculations and model training were performed on the