key: cord-0799896-gf7iuqrg
authors: Li, Thomas J. X.; Reidys, Christian M.
title: On an enhancement of RNA probing data using information theory
date: 2020-08-07
journal: Algorithms Mol Biol
DOI: 10.1186/s13015-020-00176-z
sha: 2abc8534c7ecfbc9a0c5fa1802316342b283ca08
doc_id: 799896
cord_uid: gf7iuqrg

Identifying the secondary structure of an RNA is crucial for understanding its diverse regulatory functions. This paper focuses on how to enhance target identification in a Boltzmann ensemble of structures via chemical probing data. We employ an information-theoretic approach to solve the problem, via considering a variant of the Rényi-Ulam game. Our framework is centered around the ensemble tree, a hierarchical bi-partition of the input ensemble, that is constructed by recursively querying about whether or not a base pair of maximum information entropy is contained in the target. These queries are answered via relating local with global probing data, employing the modularity in RNA secondary structures. We present that leaves of the tree are comprised of sub-samples exhibiting a distinguished structure with high probability. In particular, for a Boltzmann ensemble incorporating probing data, which is well established in the literature, the probability of our framework correctly identifying the target in the leaf is greater than [Formula: see text] .

Computational methods for RNA secondary structure prediction have played an important role in unveiling the various regulatory functions of RNA. In the past four decades, these approaches have evolved from predicting a single minimum free energy (MFE) structure [1, 2] to Boltzmann sampling an ensemble of possible structures [3, 4] . Despite its success in a wide range of small RNAs, these thermodynamics-based predictions are by no means perfect.

In parallel, experiments by means of chemical and enzymatic probing have become a frequently used technology to elucidate RNA structure [5] [6] [7] . The basic idea of these probing methods is to use chemical probes that react differently with paired or unpaired nucleotides. The binding sites can later be detected by biochemical techniques, such as selective 2 ′ -hydroxyl acylation with primer extension (SHAPE) [7, 8] , which yield reactivities at nucleotide resolution. To some extent, these reactivities provide information concerning single-stranded or double-stranded RNA regions. However, the reactivity does not unambiguously determine a specific position to be unpaired or paired [9] . While high SHAPE reactivity matches well with unpaired nucleotides, medium reactivity could correspond either to paired or unpaired nucleotides depending on various factors, such as the RNA structure itself or the experimental conditions. Recent advances focus on the development of thermodynamicsbased computational tools that incorporate such experimental data as soft constraints to handle the ambiguity [7, 10, 11] .

While the use of probing data has significantly improved the prediction accuracy of in silico structure prediction for several classes of RNAs [12] , these methods have not solved the folding problem for large RNA systems, such as long non-coding RNAs (lncRNAs, typically 200-20k bases). The reason is that the footprinting data is one-dimensional, i.e. it does not identify base pairing partners of a given nucleotide. In particular, probing data alone cannot distinguish short-range and long-range base pairings. For long RNAs, the existence of the latter, however, has been shown experimentally [13] as well as theoretically [14, 15] . Thus, even combined with experimental data, there are still numerous RNA folds consistent with the probing data.

Based on chemical probing, Sanbonmatsu et al. [16] developed a fragmentation method for determining the secondary structure of lncRNAs in the wet lab. Their approach applies chemical probing of the entire RNA, followed by parallel probing of certain overlapping fragments. Regions of each fragment exhibiting similar probing profiles are folded independently, and combined in order to obtain the entire structure. Although the method has been successfully applied to identify the structures of several lncRNAs [16, 17] , their choice of fragments is empirical, which hinders its application to longer RNA sequences [17] .

In the following, as in [16] , we shall stipulate ( * ) : in all probing experiments there exists a unique distinguished structure, the target. We furthermore assume that the collection of all possible structures is in thermodynamic equilibrium, i.e. a Boltzmann ensemble, and the target is contained in the ensemble. Hence, the problem of structure prediction gives rise to the following challenge:

In relation to [16] , our approach to Problem 1 can be understood as well as outlined as follows: Sanbonmatsu employs in parallel localization of the chemical probing experiment via fragmentation. The latter are somewhat ad hoc and almost certainly "break" any long-range base pairs, see Fig. 1 .

The novelty of this paper lies in a different, sequential fragmentation process, assuming ( * ) . Our input consists of the probing data of the entire sequence, giving rise (1) How to enhance target identification in a Boltzmann ensemble of structures?

to an augmented Boltzmann ensemble containing by construction the target (which of course is not known). Instead of a parallel fragmentation into subsequences, we "localize" differently, namely we successively ask whether a specific base pair is contained in the target or not. The particular base pair is identified using information theoretic properties of the Boltzmann ensemble. While we do not know the target explicitly, we can decide, with high accuracy, if it contains a particular base pair. Specifically, we cut the subsequence covered by the base pair, and glue the remainder at the cut-points. On the resulted two subsequences, our approach requires probing data to be generated from probing experiments. We then compare the additional probing data with the initial probing profile of the entire sequence. At a fundamental level, our fragmentation is different from Sanbonmatsu's approach [16] , in that we allow bases from two non-contiguous fragments to pair, see Fig. 1 . As a consequence, our method is well suited to deal with the long-range base pairings, these being a prominent feature of RNA secondary structures [13, 14] .

The answer to each question produces a split of the ensemble into two sub-samples, and we arrive at smaller sub-samples via successively querying and answering. We then establish that, after a few iterations, we arrive at a sample that contains a distinguished structure that, with high probability, coincides with the target. We illustrate the overall strategy in Fig. 2 .

We formalize the above sequential process, by considering a variant of the Rényi-Ulam game, in which a player tries to identify an unknown object via asking yes-no questions [18, 19] . Our framework is centered around the ensemble tree, a hierarchical bi-partition of the input ensemble, whose leaves are comprised of sub-samples exhibiting a distinguished structure with high probability. Specifically, the ensemble tree is constructed by recursively querying about whether or not a base pair of maximum information entropy is contained in the target. We prove that the query of maximum entropy base pair splits full length transcript Is (i,j) contained? Fig. 1 The fragmentation of Sanbonmatsu et al. [16] (LHS) and our approach (RHS). Their approach almost certainly "breaks" long-range base pairs (dotted arcs), while ours allows bases from two non-contiguous fragments to pair the ensemble into two even parts and in addition provides maximum reduction in the entropy of the ensemble. These questions can be answered in the affirmative because of assumption ( * ) . They are answered via relating additional "local" probing data with the initial one, employing the modularity in RNA secondary structures. By this means, we identify the correct path in the ensemble tree from the root to the leaf.

The key result of this paper is that the probability of the ensemble tree correctly identifying the target in the leaf is greater than 90% , for the Boltzmann ensembles from random sequences of length 300, in "Target identification" section. To demonstrate the result, we firstly utilize a q-Boltzmann sampler with signature distance filtration, which is well suited for Boltzmann ensembles subjected to the probing data constraint [7, 11] , see "The Boltzmann ensemble" section. Secondly, we consider the error rates arisen from answering the queries via probing data. We show that these error rates can be significantly reduced via repeated queries in "Target identification" section. Thirdly, in "Entropy" section, we prove that the leaf with low information entropy contains a distinguished structure. We present that, once in the correct leaf, the probability the distinguished structure being identical to the target is almost always correct. Fourthly, in "Robustness" section, we analyze the robustness of our approach. We demonstrate that the ensemble tree localizing the target with high fidelity is robust, across Boltzmann samples of different sizes and nucleotide compositions.

We would point out that the q-Boltzmann sampler is only required to benchmark our approach on random sequences, due to the absence of probing data. In application scenarios where chemical probing data is provided, our approach utilizes Boltzmann ensembles with soft constraints [7] .

As proof of concept, we apply our approach to natural RNAs with SHAPE probing data and compute the distinguished structure from the ensemble tree to predict the accepted secondary structure, i.e., the target. We show in "Performance comparison" section that our approach improves the average prediction accuracy by 5% , compared with [8] .

To summarize, the key points of our approach are:

1. our method is based on a Boltzmann sample and derives a sub-sample that contains the target with high probability, 2. the derivation is facilitated by means of the ensemble tree, and the identification of the correct path from root to leaf, is obtained by a variant of the Rényi-Ulam game, 3. the answers to the respective queries are inferred from chemical probing, by relating additional probing data to the initial data using modularity.

This paper is organized as follows: in "Methods" section, we introduce the main elements of our framework: the Rényi-Ulam game, the Boltzmann ensemble, base-pair queries and the ensemble tree. In "Path identification" The workflow diagram (LHS) and schematic example (RHS) of our approach. LHS: we compute the ensemble tree, i.e., successively split the Boltzmann ensemble of structures into smaller sub-samples, by querying certain base pairs is contained in the target or not. RHS: the answers to the base-pair questions allow us identify a path (red) in the ensemble tree from the root to the leaf, which contains a distinguished structure section, we demonstrate how to integrate additional probing data with the initial ones allowing to answer the queries, thereby identifying the correct path. In "Results" section, we analyze the ensemble tree and present that our approach identifies the target reliably and efficiently. Finally, we discuss and integrate our results in "Discussion" and Conclusion" sections.

We now approach Problem 1 via the Rényi-Ulam game, a two-person game, played by a questioner (Q) and an oracle, (O). Initially O thinks of an integer, Z, between one and one million and Q's objective is to identify Z, asking yes-no questions. O is allowed to lie at a rate specific to yes and no, respectively. The Rényi-Ulam game has been extensively studied since the early works by Rényi and Ulam [18, 19] , and has various applications such as adaptive error-correcting codes in the context of noisy communication [20, 21] . Depending on the respective application scenario, numerous variants of the Rényi-Ulam game have been considered, specifying the format of admissible queries or the way O lies [22, 23] .

In what follows, we shall play the following version of the game: O holds a set of bit strings y 1 y 2 . . . y l of finite length l, not every bit string being equally likely selected and the queries ask for the state of the ith-bit, i.e., Q executes bit query. O's lies occur at random, are independent and context-dependent. Specifically, O lies with probability e 0 and e 1 in case of the truthful answer being "No" and "Yes", respectively. The particular cases e 0 = 0 and e 1 = 0 have been studied in the context of half-lies [24] .

The majority of studies on the Rényi-Ulam game to date is combinatorial. That is, they stipulate the number of lies (or half-lies) being a priori known and focus on finding optimal search strategies which uses a minimum number of queries to identify the target in all cases [23, 24] .

Within the framework of this paper, we study the manifestation of the oracle, which is embodied as an indicator random variable whose distribution is derived from a modularity analysis on RNA MFE-structures, see "Path identification". In the manifestation, erroneous responses arise intrinsically at random: either as a result of the distribution of the random variable (r.v.) or intrinsic errors of the experimental data.

By construction, this rules out a unique winning strategy for Q: instead, we consider the average fidelity or accuracy to identify the target utilizing a sub-optimal number of queries. We shall propose an entropy-based strategy: at any point a query is selected relative to the subset of bit strings coinciding with the target in all previously identified positions, that maximizes the uncertainty reduction on the subset.

At a given point in time, an RNA sequence, x , assumes a fixed secondary structure, by establishing base pairings. Over time, however, x assumes a plethora of RNA secondary structures appearing at specific rates, see Appendix A for details and context on RNA. These exist in an equilibrium ensemble expressed by the partition function [3] of x.

More formally, the structure ensemble, of x is a discrete probability space over the set of all secondary structures, equipped with the probability p(s) of x folding into s. We shall assume that the ensemble of structures is in thermodynamic equilibrium, the distribution of these structures being described as a Boltzmann distribution. The Boltzmann probability, p(s), of the structure s is a function of the free energy E(s) of the sequence x folding into s, computed via the Turner energy model [25, 26] , see Appendix B for details. The Boltzmann probability p(s) is expressed as the Boltzmann factor exp (−E(s)/RT ) , normalized by the partition function,

where R denotes the universal gas constant and T is the absolute temperature. The Boltzmann distribution facilitates the computation of the partition function Z for each substructure. The partition function algorithm [3] for secondary structures computes Z and, in particular, the base pairing probabilities based on the free energies for each structure within the structure ensemble .

Let p i,j denote the probability of a base pairing between nucleotides i and j in the ensemble . Clearly, p i,j can be computed as the sum of probabilities of all secondary structures that contain (i, j), that is, where δ i,j (s) denotes the occurrence of the base pair (i, j) in s.

The thermodynamics-based partition function has been extended to incorporate chemical probing data to generate a Boltzmann ensemble, probe . These approaches [7, 10, 11] transform structure probing data into a pseudo energy term, �G(s) , which reflects how well the structure agrees with the probing data. The Turner free energy is then evaluated by adding the pseudo energy term to the loop-based energy, i.e., E probe (s) = E(s) + �G(s) . The corresponding equilibrium ensemble, probe , is distorted 15:15 in favor of structures that are consistent with probing data, see Appendix C.

For sequences whose probing data are not available, we utilize the 0-1 signature of the target, which is suited for probing data, and quantify the discrepancy between the Boltzmann ensemble and the target via the signature distance d sn . The 0-1 signature of a structure s is a 0-1 vector with k-th tuple being 0 when the k-th base is unpaired in s, and 1 otherwise. The signature distance d sn (s, s ′ ) between two structures s and s ′ is the Hamming distance between their corresponding 0-1 signatures, see Appendix A. We present that the average distance for an unrestricted ensemble to a random target is 0.21n, while the distance for an ensemble probe incorporating simulated probing data is reduced to 0.03n, see Appendix D. This motivates us to define a q-Boltzmann ensemble, q , which consists of structures having signature distance to the target s at most qn, i.e., � q = {s ′ |d sn (s ′ , s) ≤ qn} . By construction, the computation of q requires the 0-1 signature of the target, and does not need experimental probing data. In particular, we present that the ensemble probe has an average normalized signature distance similar to a q-ensemble having q = 0.05 . In this paper we discuss unrestricted and restricted Boltzmann ensembles, and q . We shall employ greyscale diagrams in order to visualize a sample of secondary structures by superimposing them in one diagram, visualizing the base pairing probabilities. A greyscale diagram displays each base pair (i, j) as an arc with greyscale 1 − p i,j , where greyscale 0 represents black and 1 represents white, see Fig. 3 .

Instead of computing the entire ensemble, we shall consider sub-samples ′ consisting of N secondary structures with multiplicities of x and refer to ′ as the sample. For sufficiently large N (typically of around size 1000, see [4] ), ′ provides a good approximation of the Boltzmann ensemble .

A sample ′ is a multiset of cardinality N and for each structure s in ′ , its multiplicity, f(s), counts the frequency of s appearing in ′ . Thus in the context of ′ , p(s) is given by the s-multiplicity divided by N, p(s) = f (s)/N . The base pairing probability p i,j has its ′ -analogue f(i, j)/N, where f(i, j) denotes the frequency of the base pair (i, j) appearing in ′ . We shall develop our framework in the context of the structure ensemble , and only reference the sample ′ , in case the results are particular to ′ .

Any structure over n nucleotides is considered as a bit string of dimension n 2 , stipulating (1) a structure is completely determined by the set of base pairs it contains and (2) any position can pair with any other position, except of itself. The bit query now determines a single bit, i.e. whether or not the base pair (i, j) is present in the target, stipulating that a unique target is assumed by the sequence in question. We associate the query about the target with a random variable, X i,j , defined on the ensemble, via questioning the presence of (i, j) in each structure. By construction, the distribution of X i,j is given by the base pairing probability P(X i,j (s) = 1) = p i,j .

Any base pair, (i, j), has an entropy, defined by the information entropy of X i,j , i.e.

where the units of H are in bits. The entropy H(X i,j ) measures the uncertainty of the base pair (i, j) in . When a base pair (i, j) is certain to either exist or not, its entropy H(X i,j ) is 0. However, in case p i,j is closer to 1/2, H(X i,j ) becomes larger.

The r.v. X i,j partitions the space into two disjoint sub-spaces 0 and 1 , where � k = {s ∈ � : X i,j (s) = k} ( k = 0, 1 ), and the induced distributions are given by Intuitively, H(X i,j ) quantifies the average bits of information we would expect to gain about the ensemble when querying a base pair (i, j). This motivates us to consider the maximum entropy base pairs, the base pair (i 0 , j 0 ) having maximum entropy among all base pairs in , i.e.

As we shall prove in "Entropy" section, X i 0 ,j 0 produces maximally balanced splits.

Equipped with the notion of ensemble and bit query (i.e. the respective maximum entropy base pairs), we proceed by describing our strategy to identify the target structure as specified in Problem 1. The first step consists in having a closer look at the space of ensemble reductions. Each split obtained by partitioning the ensemble using r.v. X i,j , can in turn be bipartitioned itself via any of its maximum entropy base pairs. This recursive splitting induces the ensemble tree, T (�) , whose vertices are sub-samples and in which its k-th layer represents a partition of the original ensemble into 2 k blocks. T (�) , is a rooted binary tree, in which each branch represents a X i,j -induced split of the parent into its two children.

Formally the process halts if either the resulting sub-spaces are all homogeneous, i.e. their structural • split k into sub-spaces k0 and k1 using the

Step 2 until all new sub-spaces either have structural entropy 0 or reach the maximum level 11. entropy is 0, which means that they contain only copies of one structure, or it reaches a predefined maximum level L. In our case we set the maximum level to be L = log 2 N + 1 = 11 , that is, the height of the ensemble tree is at most 10. The procedure is described as follows:

1. start with the ensemble . 2. for each space k with H (� k ) > 0 , where k is a sequence of 0s and 1s having length at most L − 1 = 10 , compute:

• select the maximum entropy base pair X i k ,j k of k as the feature, i.e.

In Fig. 4 we display an ensemble tree. We would remark that the ensemble tree may not be complete. The reason is that, when a sub-sample has entropy 0 and thus consists of only one structure, the splitting of this sub-sample will stop.

Clearly, for each space k , the entropies of base pairs can be computed via the traversal of each bit in each structure, and the number of bit queries grows quadratically in the sequence length n. Thus finding the maximum entropy base pair can be implemented in quadratic time O(|� k | · n 2 ) , with respect to the sequence length n. Since the sum of sizes of spaces on level i equals to N, each level of the ensemble tree requires O(N · n 2 ) computations. Therefore the time

Given the ensemble tree, we shall construct a path recursively starting from the root to identify the leaf that contains the target. We shall do so by successive bit queries about maximum entropy base pairs, see Fig. 4 .

As mentioned before, we employ two manifestations of the oracle, one using modularity based on RNA-folding, and the other determining the existence of base pairs by other experimental means.

Here we shall employ modularity of RNA structures, i.e. the loops, which constitute the additive building blocks for the free energy have only marginal dependencies. This can intuitively be understood by observing that any two loops can only intersect in at most two nucleotides, see Appendix B.

Let us introduce the notion of embedding and extraction of a contiguous subsequence or fragment, which are functions ǫ i,j : B n−m × B m → B n and ξ i,j : B n → B n−m × B m given by where B denote the set of bases {A, U, C, G} , and j − i + 1 = n − m . By construction, we have ǫ i,j • ξ i,j = id and a contiguous subsequence or fragment of an RNA sequence is called modular if it being extracted folds into the same arc configuration as it does embedded in the sequence.

Next we show how to employ probing data to reliably answer whether or not a particular (maximum entropy) arc is contained in the target structure. Structural modularity implies that if this arc can indeed be found in the target structure, then a comparative analysis of the probing data of the entire sequence with those of the extracted sequence, as well as the remainder, concatenated at the cut points will exhibit distinctive similarity. Modularity is a decisive discriminant, if, in contrast, random fragments do not exhibit such similarity.

To quantify to what extent modularity can discriminate base pairs, we perform computational experiments on random sequences via splittings. For each sequence, we consider its MFE structure s computed via Vien-naRNA [12] . We shall utilize the 0-1 signature of the MFE

to mimics its probing data. Given two positions i and j, we cut the entire sequence x into two fragments, x i,j and the remainder x i,j , i.e., ξ i,j (x) = (x i,j ,x i,j ) . Subsequently, the two fragments x i,j and x i,j refold into their MFE structures s i,j and s i,j , respectively, which are combined into a structure ǫ i,j (s i,j ,s i,j ) . If bases i and j are paired in s, such a splitting is referred to as modular and the resulting structure is denoted by s ′ . Otherwise, it is called random, with the output structure s ′′ . We proceed by computing the base-pair and signature distance from the MFE s to the structures s ′ or s ′′ . The base-pair distance is one of the most frequently used metrics to quantify the similarity of two different structures viewed as bit strings [27, 28] , the signature distance measures the similarity between their signatures, which is well suited within the context of the probing profiles, see Appendices A and D.

In the above computation, we run through all possible positions i and j. For fragments x i,j and x i,j , we compute the corresponding MFEs and distances. Accordingly, the time complexity of the computation is O(n 6 ) , given by O(n 2 ) choices of indices times the O(n 3 ) complexity of the MFE folding and the linear time of the distance computation, where n is the sequence length. Figure 5 (LHS) compares the distribution of the signature distances d sn (s, s ′ ) and d sn (s, s ′′ ) obtained from modular and random splittings, respectively. The structures induced by modular splitting have much more similar signatures to their MFE structures, than those induced by random splitting. The situation is analogous for base-pair distances, see Fig. 5 (RHS). Since these distances measure structural similarity, the data also indicates that, when i and j form a base pair in s, the fragment x i,j is more likely to fold into the same configuration as it does being embedded, i.e. x i,j is modular.

The data displayed in Fig. 5 suggests the threshold distance, θ , for signatures, by which we distinguish modular from random. More specifically, if the signature distance is smaller than θ , we predict that bases i and j are paired. Otherwise they are unpaired. In order to quantify the accuracy of this classification, we consider the resulting false discovery rate (FDR) and false omission rate (FOR). 1 In our Rényi-Ulam game variation, the expected values of FDR and FOR are the error rates e 1 and e 0 in case the truthful answer being yes and no, respectively. 

Equipped with the ensemble tree and the oracle via modularity, our framework provides a fragmentation process combining "local" probing profiles with the "global" one via modularity. The novel fragmentation process is guided by the base-pair queries of the ensemble tree inferred from the restricted Boltzmann sample incorporating chemical probing. Given the maximum entropy base pair, (i, j), extraction splits the sequence into two fragments, one being the extracted fragment x i,j and the other, x i,j . We perform probing experiments on these two segments, and obtain the reactive probabilities q i,j and q i,j , respectively. Let q be the reactive probability for the entire sequence, and q ′ be the embedding of q i,j into q i,j , i.e., q ′ = ǫ i,j (q i,j ,q i,j ) . As shown in the previous subsection, if the Hamming distance d(q, q ′ ) is smaller than threshold θ , then the probing profiles are similar, i.e., two bases i and j are paired. Otherwise, they are unpaired in the target structure. The fragmentation procedure can be summarized as follows:

1. a probing experiment for the entire sequence is performed and the reactive probability q is obtained, 2. a Boltzmann sample probe of N structures, consistent with the probing data q is computed, 3. the ensemble tree T (�) containing the sub-spaces k and the corresponding maximum entropy base pairs X i k ,j k is constructed, 4 starting with we recursively answer the queries, determining thereby a path through the ensemble tree from the root to a leaf. 5 once in a leaf, Proposition 1 guarantees the existence of a distinctive structure which we stipulate to be the target structure. Figure 7 demonstrates the workflow of the fragmentation process, which can be considered as an implementation of our overall strategy in Fig. 2 (LHS), via incorporating the new fragmentation process into the workflow. We would point out that the key of path identification is determination of base pairs. Instead of using modularity and "local" probing data, we can also apply other experimental approaches to identifying base pairs. In Appendix E, we summarize state-of-the-art experimental approaches that could possibly be utilized to determine base pairs and to identify the path in the ensemble The error rates e 0 and e 1 as a function of the threshold θ . We use the same sequences and structures as described in Fig. 5 . When the signature distance is smaller than θ , we predict that bases i and j are paired, otherwise they are unpaired. The error rates e 0 and e 1 are given by the false omission rate (FOR) and false discovery rate (FDR), respectively tree. In particular, we detail two methods, both of which utilize chemical probing data in different ways than our fragmentation [29, 30] and recover base pairs with a false discovery rate less than 0.05.

Given an input sample , we construct the ensemble tree T (�) having maximum level L = 11 , recursively computing the maximum entropy base pairs as described in Algorithm 1. In this section, we shall analyze the entropy of leaves in order to quantify the existence of a distinguished structure and to identify the target.

To quantify the uncertainty of an ensemble, we define the structural entropy of an ensemble, , of an RNA sequence, x , as the Shannon entropy the units of H being bits. The sum is taken over all secondary structures s of x , and p(s) denotes the Boltzmann probability of the structure s in the ensemble . The notion of structural entropy is originated in thermodynamics and is usually regarded as a measure of disorder, or randomness of an ensemble [31, 32] . Given a sample ′ of size N, the structural entropy has the upper bound log 2 N , that is, H(� ′ ) reaches its maximum when all sampled structures are different. Throughout the paper, we assume N = 1024 and therefore H(� ′ ) ≤ 10. Proposition 1 implies that a sample with small structural entropy contains a distinguished structure and a proof is given in Appendix F. We refer to a sample having a distinguished structure of probability at least as being -distinguished. Next we quantify the reduction of a bit query on an ensemble. Recall that the associated r.v. X i,j of a base pair (i, j) partitions the sample into two disjoint sub-samples 0 and 1 , where � k = {s ∈ � : X i,j (s) = k} ( k = 0, 1 ).

The conditional entropy, H(�|X i,j ) , represents the expected value of the entropies of the conditional distributions on , averaged over the conditioning r.v. X i,j and can be computed by Then the entropy reduction R(�, X i,j ) of X i,j on is the difference between the a priori Shannon entropy H(�) and the conditional entropy H(�|X i,j ) , i.e.

The entropy reduction quantifies the average change in information entropy from an ensemble in which we cannot tell whether or not a certain structure contains (i, j), to its bipartition where one of its two blocks consists of structures that contain (i, j) and the other being its complement.

The entropy reduction R(�, X i,j ) of X i,j is given by the entropy H (X i,j ) of X i,j , i.e.

Proposition 2 queries a Bernoulli random variable inducing a split, reducing its average conditional entropy exactly by the entropy of the random variable itself. In the context of the Rényi-Ulam game, Q asks a question that helps to maximally reduce the space of possibilities. A proof of Proposition 2 is presented in Appendix G.

The next observation shows that querying maximum entropy base pairs, induces a best possible balanced split of the ensemble. for any (i, j). Proposition 3 first shows that the bit query about the maximum entropy base pair X i 0 ,j 0 partitions the ensemble as balanced as possible, i.e. into sub-samples having the minimum difference of their probabilities. It furthermore establishes that the splits have minimum average structural entropy (or uncertainty), since X i 0 ,j 0 provides the maximum entropy reduction on the ensemble. Thus the query about (i 0 , j 0 ) is the most informative among all bit queries.

Finally we quantify the average entropy of sub-samples, t , on the t-th level of the ensemble tree, and establish the existence of a distinguished structure. The analysis of entropies depends of course on the way the samples are being constructed. To this end, given a random sequence, we construct the ensemble tree for two types of samples, one being unrestricted samples of structures, , and the other utilizing q-Boltzmann

sampling that incorporates the signature of the target, q , see "The Boltzmann ensemble" section. Specifically, the target structure is randomly selected from the unrestricted sample, and the q-Boltzmann sample utilizes the 0-1 signature of the target. We would point out that our framework does not require to "choose" the target for a sequence, and here we make the choices to facilitate the computation on random sequences.

For unrestricted Boltzmann samples, the structural entropy H(� t ) of sub-samples on the t-th level decreases, as the level t increases, see Fig. 9 . In particular, the average entropy H(� 11 ) of leaf samples is 0.328 and 0.147, for sequences having 200 and 300 nucleotides, respectively. Proposition 1 guarantees that the leaf 11 is 0.90-distinguished, i.e. containing For q-Boltzmann samples q of structures having signature distance to the target s at most qn, the small entropy of the leaf and the high ratio of the distinguished structure are robust over a range of q-values, see Fig. 10 . We also observe that, for longer sequences, the entropy is smaller, and therefore the ratio of the distinguished structure is higher.

Any leaf of the ensemble tree exhibiting a structural entropy less than one, contains, by Proposition 1, a distinguished structure. Successive queries produce a unique, distinguished leaf, * which, with high probability, contains structures that are compatible with the queries. Let s * be the distinguished structure in * , and s denote the target.

In this section, we shall analyze this probability, P(s ∈ � * ) , as well as P(s * = s) and P(s * = s | s ∈ � * ) , see Table 1 . For the path identification to the leaf * , we consider the error rates e 0 = 0.05 and e 1 = 0.01 computed in "Path identification" section.

As detailed in "Path identification" section, these probabilities depend on the error rates e 0 and e 1 , and since these errors occur independently, we derive P(s ∈ � * ) = (1 − e 0 ) l 0 (1 − e 1 ) l 1 , where l 0 and l 1 denote the number of No-/Yes-answers to queried base pairs along the path, respectively. Figure 11 displays the distribution of l 1 . We observe that l 1 has a mean around 5, i.e., the probabilities of queried base pairs being confirmed and being rejected are roughly equal. For l 0 = l 1 = 5 , we have a theoretical estimate P(s ∈ � * ) ≈ 0.736 . In Fig. 12 we present that P(s ∈ � * ) decreases as the error rate e 0 increases, for fixed e 1 = 0.01.

For (unrestricted) Boltzmann samples generated from random sequences, we present the probability P(s ∈ � * ) of the leaf containing the target is greater than 74% , which agrees with the above theoretical estimate. Note that this amounts to having no probing data as a constraint for the sampled structures, a worst case scenario, so to speak.

Furthermore, the probability that the distinguished structure is identical to the target is approximately unchanged, see Table 2 . P(s * = s | s ∈ � * ) indicates, that once we are in the correct leaf, the chance of correctly identifying the target increases to 94% for sequences Fig. 11 The distributions of l 1 , the number of queried base pairs on the path that are confirmed by the target structure. We generate unrestricted Boltzmann samples for random sequences of different lengths 

The probability of the target being in the leaf P(s * = s) The probability of the distinguished structure being identical to the target

The probability of correctly identifying the target, given that it is in the leaf of length 300. Accordingly, the key factor is the correct identification of the leaf * .

For q-Boltzmann samples q filtered by signature distance ≤ qn we observe the following: the probability P(s ∈ � * ) of the leaf to contain the target is greater than 70% is robust over a range of q-values, see Fig. 13 . As expected, as q increases, the probability of the target being in the correct leaf decreases, due to the fact that the q-samples become less constraint by the probing data.

In particular, we observe that, for q = 0.05 and sequences of length 300, the probability of the ensemble tree identifying the correct leaf is greater than 90% , see Fig. 13 (red dashed line). As the Boltzmann ensembles incorporation of probing data via pseudoenergies result in a q-value of 0.05, this translates into P(s ∈ � * ) ≥ 90% for such ensembles generated by such restricted Boltzmann samplers for sequences of length 300.

We demonstrate that the ensemble tree localizing the target with high fidelity is robust, across samples of sequences having various lengths and different signature filtration q. Figure 14 (LHS) shows that the ensemble tree for longer sequences has a higher chance of identifying the target. Once we are in the correct leaf, the chance of correctly distinguishing the target significantly increases, from around 75% to over 94% in the case of sequences having 200 nucleotides, see Fig. 14 (RHS).

As mentioned above, the key is the correct identification of the leaf containing the target, and its distinguished structure to coincide with the latter. These events are quantified via P(s ∈ � * ) and P(s * = s) , which depend on the error rates e 0 and e 1 .

These error rates can be reduced by asking the same query repeatedly. In our Rényi-Ulam game, repeating the same query is tantamount to performing the same experiment multiple times. It is reasonable to assume that experiments are performed independently and thus errors occur randomly. Intuitively, repeated experiments reduce errors originated from the noisy nature of experimental data. Utilizing Bayesian analysis, we show that, if we get the same answer to the query twice, the error rates would become significantly smaller, for example, e [2] 0 = 0.003 and e [2] 1 = 0.00005 , see Appendix H. In principle, we can reduce the error rates by repeating the same query k times. The error rates would approach to 0 as k grows to infinity. In this case, P(s ∈ � * ) ≈ 1 , i.e. the leaf always contains the target. The fidelity of the distinguished structure P(s * = s) increases from 70 to 94% for sequences of length 300. Fig. 13 The probability P(s ∈ � * ) of being in the correct leaf for different q-values. We use the same sequences and q-Boltzmann samples as described in Fig. 10 . Error bars show the standard deviations for random sequence samplings in the corresponding cases. The red dashed line denotes q-samples having q = 0.05 , which is tantamount to Boltzmann samples probe incorporating the probing data via pseudo-energies Fig. 14 The probabilities P(s * = s) (LHS) and P(s * = s | s ∈ � * ) (RHS) of correctly identifying the target, either in general or conditioning on being in the correct leaf. We use the same sequences and q-Boltzmann samples as described in Fig. 10 . Error bars show the standard deviations for random sequence samplings in the corresponding cases. The red dashed line denotes q-samples having q = 0.05 , which is tantamount to Boltzmann samples probe incorporating the probing data via pseudo-energies Li and Reidys Algorithms Mol Biol (2020) 15:15 Robustness A significant advantage of our approach is the robustness of target identification across Boltzmann samples of various sizes and different GC-contents. For sizes, we deliberately change the number N of sampled structures ranging from 2 9 to 2 11 . Accordingly, the maximum level L of the ensemble tree varies, i.e., it grows at a logarithmic scale L = log 2 N + 1 . For GC-contents, we utilize sequences with different nucleotide compositions to generate Boltzmann samples. It is believed that medium or low GC-content offers greater transcription efficiency, while high GC-content provides better structural stability [33] . We thus consider GC-rich and AU-rich random sequences. For each variant sample, we compute the distinguished structure s * in the leaf * via successive L queries, and presents the probabilities of the distinguished structure being identical to the target, see Tables 3 and 4 .

In Table 3 , we demonstrate that the ensemble tree localizing the target with high fidelity is robust, across unrestricted samples of various sizes. Additionally, Table 3 shows that the ensemble tree for longer sequences has a higher chance of identifying the target, see the probabilities displayed in italics. We also observe that the probability of being in the distinguished leaf, P(s ∈ � * ) , slightly decreases, as the sample size increases. This can be improved by repeating the same query multiple times as shown in "Target identification" section. Table 4 shows the robustness of target identification across samples of different GC-contents. This indicates that the effectiveness of our approach remains unaffected by RNA sequences with various GC-contents.

Here we apply our approach to natural RNAs, and compare the performance with the RNA structure modeling method developed by Hajdin et al. [8] . First, we use the data set of 18 RNAs with published SHAPE profiles and accepted secondary structures [8] . This data set includes tRNAs, ribosomal RNAs, riboswitches, and viruses. RNA lengths vary from 34 to 530 nucleotides, see Table 5 . We consider the accepted secondary structure excluding pseudoknots as the target. Specifically, we reduce a pseudoknot by removing the helix having the minimum size in the pseudoknot. Then, for each sequence, we incorporate chemical probing data as pseudo energies [7] and generate a Boltzmann sample probe of 2 10 structures, see "The Boltzmann ensemble" section. We compute the ensemble tree for each sample probe , and identify the distinguished leaf * via successive base-pair queries on the target. For each base-pair query, we consider the error rates e 0 = 0.05 and e 1 = 0.01 computed in "Path identification" section. We output the distinguished structure s * in the leaf as the predicted structure.

Here we drop the assumption that the target is always in the sample. We consider the model-agnostic property of our approach, i.e., whether it guarantees to find the "best" structure in the sample even if the correct target is not contained. Also, we point out that our framework does not require a priori knowledge of the target structure. However, because we are in lack of probing data on fragments to confirm or reject certain base-pairs using modularity, we need to utilize knowledge of the target to answer the queries.

We generate 1000 random sequences of length n. For each sequence, we then generate (unrestricted) Boltzmann samples of N structures together with a target structure s. The size N of the samples varies from 2 9 to 2 11 , and the maximum level of the ensemble tree is given by L = log 2 N + 1 . We compute the probabilities of identifying the target utilizing the ensemble tree. We display mean and standard deviation Table 4 The

We generate 1000 random sequences of length n with different GC-contents, where GC-rich sequences consist of 30% Gs, 30% Cs, 20% As and 20% Us; Uniform comprise 25% Gs, 25% Cs, 25% As and 25% Us; AU-rich contain 20% Gs, 20% Cs, 30% As and 30% Us. This process can be done by software such as GenRGenS [34] . For each sequence, we then generate (unrestricted) Boltzmann samples of N = 2 10 structures together with a target structure s. We compute the probabilities of identifying the target utilizing the ensemble tree. We display mean and standard deviation Nevertheless, as a proof of concept, we present the computational results of our approach applying to the data set. To compare with [8] , we compute three measures of performance: sensitivities, positive predictive value (PPV), and accuracy. We show that the accuracy of our method is, on average, 5 percentage points higher than that of [8] . Moreover, the improvement on the target identification accuracy is robust, across sequences of different types and various lengths. Although the sample probe does not contain the target structure for sequences longer than 76, the results demonstrate that our approach is capable of identifying the "best" structure, which is defined as the one in the sample having the smallest base-pair distance to the target.

In our framework, the key factor is the correct identification of the leaf that contains the target. Figure 15 displays the average base-pair distances d bp (s, � t ) 2 between the target structure s and the t-th sub-sample t on the path. We contrast three scenarios, first the expectation being taken over all ensemble trees (blue), the set of ensemble trees in which the leaf containing the target is identified (green) and its complement (orange). We here present that the correct identification of the leaf containing the target significantly reduces the distance between the target and the sub-samples.

We consider the accepted secondary structure excluding pseudoknots as the target. Our method identifies the distinguished structure from a Boltzmann sample of 2 10 structures via 10 base-pair queries on the target. For each base-pair query, we consider the error rates of accepting or rejecting a base pair, e 0 = 0.05 and e 1 = 0.01 . To compare the structure prediction approach in [8] and our method, we present three measures of performance: sensitivities (Sens), the fraction of pairs in the accepted structure that are predicted ( Sens = FP TP+FN ); positive predictive value (PPV), the proportion of predicted pairs that are in the accepted structure ( PPV = FP TP+FP ); and accuracy (Acc), the harmonic mean of Sens and PPV ( Acc = 2·Sens·PPV Sens+PPV . 15 The average base-pair distance d bp (s, � t ) between the target s and the sub-sample t on the path. The expectation is taken over all ensemble trees (blue), the set of ensemble trees in which the leaf containing the target is identified (green) and its complement (orange). The computation is based on the Boltzmann samples of sequences of length 300

Our framework is based on two assumptions. The first is sampling from the Boltzmann ensemble of structures. This assumption is important, as for an arbitrary sample, the splittings could be highly unbalanced and the leaf of the ensemble tree does not always contain a distinguished structure. By quantifying the distinguished structure via the flow of entropies of sub-samples on the path, we contrast three classes of samples, the first being a Boltzmann sample (B-sample), the second a uniform sample (U-sample) and the third an E-sample, 3 see Fig. 16 . We present that, in a Boltzmann sample, the entropies of sub-samples on the t-th level decrease much more sharply than those in the latter two classes, see Fig. 16 (LHS). In particular, the latter two produce leaves exhibiting an average entropy greater than 1, i.e. not containing a distinguished structure. As proved in Proposition 2, the entropy reduction equals to the entropy of the queried base pair. Figure 16 (RHS) explains the reason for the significant reduction, that is, the maximum entropy base pairs in Boltzmann samples have entropy close to 1 on each level, implying that the bit queries split the ensemble roughly in half each time. The latter two types of samples do not exhibit this phenomenon. In upcoming work, we shall investigate this phenomenon via quantifying how the uncertainty or entropy of the ensemble is distributed in the bit queries.

The second assumption is that the target is contained in the sample. This assumption can be validated by generating samples of larger size, and checking whether or not the distinguished structure is reproducible. We would remark that, even though the probability of identifying the target P(s * = s) slightly decreases for larger samples (Table 3) , we can significantly improve P(s * = s) by performing the same experiment multiple times, see "Target identification" and Appendix H.

Accordingly, the probability and entropy of a base pair is calculated in the context of the entire ensemble, and thus the ensemble tree together with maximum entropy base pairs. [32] show that the structural entropy of the entire Boltzmann ensemble is asymptotically linear in n, i.e. H(� entire ) ≈ 0.07n . Since each queried base pair reduces the entropy by approximately 1 and the reduction is additive by construction, the ensemble tree would require approximately 0.07n queries to identify a leaf that has entropy smaller than 1 and contains a distinguished structure.

For a sample of RNA pseudoknotted structures, the ensemble tree in our framework can still be computed. However, the structure modularity no longer holds in the pseudoknot case. The reason is that a pseudoknot loop could intersect in more than one base pair with other loops, see Fig. 17 (RHS). The fragmentation with respect to a base pair involved in a pseudoknot could affect several loops, each contributing to the free energy. The change of loop-based energy could lead to splits folding into a different configuration compared to the full transcript. Nevertheless, it would be interesting to find out other experimental methods to facilitate our framework for RNA pseudoknotted structures.

In this paper we propose to enhance the method of identifying the target structure based on RNA probing data. To facilitate this we introduce the framework of ensemble trees in which a sample derived from the partition function of structures is recursively split via queries using information theory. Each query is answered based Fig. 16 The average entropy of sub-samples H(� t ) (LHS) and queried base pairs H(X t ) (RHS) on the t-th level of the ensemble tree. We contrast the ensemble trees obtained from a Boltzmann sample (B, blue), a uniform sample (U, orange), or an E-sample (E, green), which is comprised of 2 10 distinct structures, each containing only one base pair. For the former two types of samples, we randomly generate 1000 sequences of length 200. For each sequence, we sample 2 10 structures together with a target structure s, according to the Boltzmann or uniform distributions on either RNA folding data in combination with chemical probing, employing modularity of RNA structures, or, alternatively, directly using experimental methods [29, 30] . The former type of inference can be viewed as a kind of localization of probing data, relating local to global data by means of structural modularity. We show that within this framework it is possible to identify the target with high fidelity and that this identification requires a small number of base pairs to be queried. In particular we present that, for the Boltzmann ensembles incorporating probing data via pseudo-energies, the probability of the ensemble tree identifying the correct leaf that contains the target is greater than 90% , see "Target identification" section.

not a specific position is unpaired, or paired; instead, the method produces a probability. The probing data thus produce a vector of probabilities. Several competing methods have been developed to convert the footprinting data for each nucleotide into a probability. Due to its ambiguity, probing data has been further incorporated into RNA folding algorithms by adding a pseudo-energy term, �G(s) , to the free energy [7, 10, 11] , i.e. This term engages in the folding process as follows: while positions where structure prediction and experiment data agree with each other are rewarded by a negative pseudo-energy, mismatching locations receive a penalty by way of a positive term. This is tantamount to shifting the partition function in such a way that the equilibrium distribution of structures in probe favors those that agree with the data.

Here we incorporate the signature of a target via restricted Boltzmann sampling structures with the signature distance filtration.

We first analyze the signature distances in two classes of Boltzmann samples, one being unrestricted, , and the other being restricted probe that incorporates simulated E probe (s) = E(s) + �G(s).

probing data via pseudo-energies. The target structure is randomly selected from the unrestricted sample, and probing data is simulated from the signature of the target by a binary model. That is, the reactivity is set to 0.1 when a nucleotide is unpaired in the target, and to 0.7 when it is paired. These values are computed from the mean of collected SHAPE data among both paired and unpaired nucleotides in E. coli sequences [8] . According to [7] , if SHAPE reactivity is 0.36, which lies in the middle of 0.1 and 0.7, there is no added pseudo-energy, i.e., G = 0.

For both types of samples, the distribution of the signature distance between the target s and the ensemble is approximately normal, Fig. 18 . The means and variances of the normalized signature distance are shown in Table 6 . It shows that, while the average signature distance between the target and the unrestricted sampled structure is around 0.21n, integrating the signature of the target reduces the distance to 0.03n. This indicates that the incorporation of the signature improves the accuracy of the Boltzmann sampler identifying the target.

The above analysis motivates us to introduce a q-Boltzmann sampler for structures with signature distance filtration. For any fraction q ∈ (0, 1) , let q denote the restricted Boltzmann ensemble of structures having signature distance to the target at most q · n , i.e., � q = {s ′ |d sn (s ′ , s) ≤ q · n} . The enhanced Boltzmann Fig. 17 The loop-based decomposition of a secondary structure (LHS) and a pseudoknot (RHS). LHS: two adjacent loops intersect at one base pair. RHS: two pseudoknot loops meet at two base pairs (orange) Fig. 18 The distributions of the signature distances d sn (s, �) and d sn (s, � probe ) between the target s and two types of Boltzmann ensembles. LHS utilizes unrestricted samples of structures, and RHS uses samples probe incorporating probing data simulated from the signature of the target structure s Li and Reidys Algorithms Mol Biol (2020) 15:15 sampling can be implemented by partition function [3] and stochastic backtracking technique [4] , with the augmentation via an additional index recording the signature distance. A complete description of the new sampler will be provided in a future publication. The constraint on the signature distance changes the equilibrium distribution of structures via eliminating those that are inconsistent with signature over certain ratio q. Table 6 shows the means and variances of the normalized signature distance for q . In particular, we observe that Boltzmann samples probe incorporating the probing data via pseudo-energies behave similarly as q-samples having q = 0.05.

Determination of base pairs is a fundamental and longstanding problem in RNA biology. A large variety of experimental approaches have been developed to provide reliable solutions to the problem, such as X-ray crystallography, nuclear magnetic resonance (NMR), cryogenic electron microscopy (cryo-EM), chemical and enzymatic probing, cross-linking [39] [40] [41] [42] . Each method has certain strengths and limitations. In particular, chemical probing, as one of the most widely accepted experiments, allows to detect RNA duplexes in vitro and in vivo, and has been combined with high-throughput sequencing to facilitate large-scale analysis on lncRNAs [42] . Thus, in the following, we focus on determining the queried base pairs via chemical probing.

Chemical probing data is one-dimensional, i.e. it does not specify base pairing partners. Thus probing data itself does not directly detect base pairings, and any structure information can only be inferred based on compatibility with probing data. Two strategies of structural inference have been developed, correlation analysis and mutate-and-map. [29] introduce PAIR-MaP, which utilizes mutational profiling as a sequencing approach and correlation analysis on profiles. The authors claim that PAIR-MaP provides around 0.90 accuracy of structure modeling (on average, sensitivity 0.96 and false discovery rate 0.03). [30] introduce M2-seq, a mutate-and-map approach combined with next generation sequencing, which recovers duplexes with a low false discovery rate ( < 0.05).

Proposition 4 Let ′ be a sample of size N and s ∈ ′ be a structure having probability p 0 . Then the structural entropy of ′ is bounded by where Proof By construction, the multiplicity of s in ′ is given by p N 0 = ⌊p 0 N ⌋ . Since the function −x log 2 x is for x > 0 concave, the structural entropy is maximal in case of all remaining N − p N 0 structures being distinct, i.e. each occurs with probability (1 − p 0 )/(N − p N 0 ) = 1/N . Therefore On the other hand, the minimum is achieved when all remaining structures are the same. Thus H min (p 0 ) = −p 0 log 2 p 0 − (1 − p 0 ) log 2 (1 − p 0 ) . Now we prove Proposition 1.

Proof of Proposition 1 Let s 0 be the structure having the highest probability p 0 in ′ . By Proposition 4, we have 

is strictly increasing on p i,j ∈ [0, 1/2] and symmetric with respect to p i,j = 1/2 . Therefore, |P(� i,j 0 ) − P(� i,j 1 )| reaches its minimum when H (X i,j ) has the maximum value, that is, X i 0 ,j 0 . Assertion (2) follows directly from Proposition 2.

Given two features X 1 and X 2 , we can partition either first by X 1 and subsequently by X 2 , or first by X 2 and then by X 1 , or just by a pair of features (X 1 , X 2 ) . In the following, we will show that all three approaches provide the same entropy reduction of .

Before the proof, we define some notations. The joint probability distribution of a pair of features (X 1 , X 2 ) is given by q i 1 ,i 2 = P(X 1 = x (1) i 1 , X 2 = x (2) i 2 ) , and the marginal probability distributions are given by q (1) i 1 = P(X 1 = x (1) i 1 ) and q (2) i 2 = P(X 2 = x (2) i 2 ) . Clearly, i 1 q i 1 ,i 2 = q (2) i 2 and i 2 q i 1 ,i 2 = q (1) i 1 . The joint entropy H (X 1 , X 2 ) of a pair (X 1 , X 2 ) is defined as

The conditional entropy H (X 2 |X 1 ) of a feature X 2 given X 1 is defined as the expected value of the entropies of the conditional distributions X 2 , averaged over the conditioning feature X 1 , i.e. H (X 1 , X 2 ) = − i 1 i 2 q i 1 ,i 2 log 2 q i 1 ,i 2 .

Proposition 6 Let R(�, X 1 , X 2 ) denote the entropy reduction of first by the feature X 1 and then by the feature X 2 , and R(�, (X 1 , X 2 )) denote the entropy reduction of by a pair of features (X 1 , X 2 ) . Then Proof By Proposition 2, we have Let i 1 denote the spaces obtained by partitioning via X 1 , i.e. � i 1 = (� i 1 , P(� i 1 ), p i 1 ) , where � i 1 = {s ∈ � : X 1 (s) = x (1) i 1 } , and where q (1) i 1 = P(X 1 = x (1) i 1 ) . Then the space i 1 is further partitioned into i 1 ,i 2 via X 2 . That is, � i 1 ,i 2 = (� i 1 ,i 2 , P(� i 1 ,i 2 ), p i 1 ,i 2 ) , where � i 1 ,i 2 = {s ∈ � i 1 : X 2 (s) = x (2) i 2 } , and

The entropy reduction R(�, X 1 , X 2 ) is given by the difference between the a priori Shannon entropy H(�) and the conditional entropy H((�|X 1 )|X 2 ) , which is the expected value of the entropies of i 1 ,i 2 , weighted by the probability P(s ∈ � i 1 ,i 2 ) = P(X 2 = x (2) i 2 , X 1 = x (1) i 1 ) = q i 1 ,i 2 . In view of Proposition 2, we derive (4) H(X 1 , X 2 ) = H(X 1 ) + H(X 2 |X 1 ).

(5) R(�, X 1 , X 2 ) = R(�, (X 1 , X 2 )).

R(�, X 1 ) = H(X 1 ), R(�, (X 1 , X 2 )) = H(X 1 , X 2 ). Eq. (5) follows.

The maximum entropy of an arbitrary feature is achieved when all its outcomes occur with equal probability, and this maximum value is proportional to the logarithm of the number of possible outcomes to the base 2. Thus Proposition 2 implies that the more possible outcomes a feature has, the higher entropy reduction it could possibly lead to.

Meanwhile, a feature with an arbitrary number of outcomes can be viewed as a combination of binary features, the ones with two possible outcomes. Even though the entropy of the combination of two features is greater than each of them, Proposition 6 shows that partitioning the space subsequently by two features has the same entropy reduction as partitioning by their combination. Therefore, instead of considering features with outcomes as many as possible, we focus on binary features.

Here we assess the improvement of the error rates by repeating the same query twice. Let Y (or N) denote the event of the queried base pair existing (or not) in the target structure. Let y (or n) denote the event of the experiment = H (X 1 , X 2 ) = R(�, (X 1 , X 2 )).

confirming (or rejecting) the base pair. Let nn denote the event of two independent experiments both rejecting the base pair. Similarly, we have yy and yn. Utilizing the same sequences and structures as described in Fig. 5 , we estimate the conditional probabilities P(n|N ) ≈ 0.993 and P(n|Y ) ≈ 0.055 . The prior probability P(Y ) can be computed via the expected number l 1 of confirmed queried base pairs on the path, divided by the number of queries in each sample. Figure 11 displays the distribution of l 1 having mean around 5. Thus we adopt P(Y ) = P(N ) = 0.5 . By Bayes' theorem, we calculate the posterior where P(nn) = P(nn|N )P(N ) + P(nn|Y )P(Y ) . Since two experiments can be assumed to conditionally independent given Y and also given N, we have P(nn|N ) = P(n|N ) 2 and P(nn|Y ) = P(n|Y ) 2 . Similarly, we compute P(Y |nn) , P(Y |yy) and P(Y |yn) etc, see Table 7 . It demonstrates that, if we get the same answer to the query twice, the error rates would become significantly smaller, for example, e [2] 0 = 0.003 and e [2] 1 = 0.00005 . In the case of mixed answers ny or yn, its probability P(ny) = 0.0292 , i.e., it rarely happens. We would recommend a third experiment and take the majority of three answers when getting two mixed answers.

In principle, we can extend to reducing the error rates by repeating the same query k times. The above Bayesian argument is then generalized to sequential updating on the error rates from e 0 to e [k] 0 . We can show that e [k] 0 and e [k] 1 approach to 0, as k grows to infinity. In this case, the reliability of the leaf space P(s ∈ � 11 ) is 1, i.e. P(N |nn) = P(nn|N )P(N ) P(nn) = P(n|N ) 2 P(N ) P(n|N ) 2 P(N ) + P(n|Y ) 2 P(Y ) , Table 7 The posterior probabilities after two experiments

We use the same sequences and structures as described in Fig. 5 

Secondary structure of single-stranded nucleic acids

RNA secondary structures and their prediction

The equilibrium partition function and base pair binding probabilities for RNA secondary structure

A statistical sampling algorithm for RNA secondary structure prediction

Structural analysis of RNA using chemical and enzymatic probing monitored by primer extension

RNA structure analysis at single nucleotide resolution by selective 2?-hydroxyl acylation and primer extension (SHAPE)

Accurate SHAPE-directed RNA structure determination

Accurate SHAPE-directed RNA secondary structure modeling, including pseudoknots

Evaluating the accuracy of SHAPE-directed RNA secondary structure predictions

RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction

Integrating chemical footprinting data into RNA secondary structure prediction

ViennaRNA Package 2.0

mRNAs and lncRNAs intrinsically form secondary structures with short end-to-end distances

The rainbow spectrum of RNA secondary structures

The block spectrum of RNA pseudoknot structures

3s: shotgun secondary structure determination of long non-coding RNAs

COOLAIR antisense RNAs form evolutionarily conserved elaborate secondary structures

On a problem of information theory

Adventures of a mathematician

A mathematical theory of communication

Block coding for the binary symmetric channel with noiseless, delayless feedback

Searching with known error probability

Ulam's searching game with a fixed number of lies

Coping with errors in binary search procedures

Expanded sequence dependence of thermo-dynamic parameters improves prediction of RNA secondary structure

Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure

On finding all suboptimal foldings of an RNA molecule

Comparing RNA secondary structures using a relaxed base-pair score

RNA base pairing complexity in living cells visualized by correlated chemical probing

RNA structure inference through chemical mapping after accidental or intentional mutations

Characterising RNA secondary structure space using information entropy

RNA thermodynamic structural entropy

Frnakenstein: multiple target inverse RNA folding

GenRGenS: software for generating random genomic sequences and structures

Combinatorics of RNA hairpins and cloverleaves

RNA secondary structure

Computation of generating functions for biological molecules

Spaces of RNA secondary structures

A glimpse of structural biology through x-ray crystallography

Characterizing RNA dynamics at atomic resolution using solution-state NMR spectroscopy

How cryo-EM is revolutionizing structural biology

Review toward all RNA structures, concisely

Elements of information theory (Wiley series in telecommunications and signal processing)

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

We want to thank Christopher Barrett for stimulating discussions and the staff of the Biocomplexity Institute & Initiative at University of Virginia for their great support. We would like to thank Dr. Kevin Weeks for pointing out their recent work [29] . Many thanks to Qijun He, Fenix Huang, Andrei Bura, Ricky Chen, and Reza Rezazadegan for discussions.

All authors contributed to this work. TJL derived the mathematical results. Both authors read and approved the final manuscript.

Source code of the implementation of our algorithm is available from https :// githu b.com/Gauss Backy ard/RNASt ructu reIde ntifi er.

The authors declare that they have no competing interests. 1 

Most computational approaches of RNA structure prediction reduce to a class of coarse grained structures, i.e. the RNA secondary structures [1, [35] [36] [37] [38] . These are contact structures via abstracting from the actual spatial arrangement of nucleotides. An RNA secondary structure can be represented as a diagram, a labeled graph over the vertex set {1, . . . , n} whose vertices are arranged in a horizontal line and arcs are drawn in the upper halfplane. Clearly, vertices correspond to nucleotides in the primary sequence and arcs correspond to the Watson-Crick A-U, C-G and wobble U-G base pairs. Two arcs (i 1 , j 1 ) and (i 2 , j 2 ) form a pseudoknot if they cross, i.e. the nucleotides appear in the order i 1 < i 2 < j 1 < j 2 in the primary sequence. An RNA secondary structure is a diagram without pseudoknots.We define two distances for comparing two structures, the base-pair and signature distances.The base-pair distance utilizes a representation of a secondary structure s as a bit string b(s)where l denotes the number of all possible base pairs, and b k is a bit. Given the arc set E equipped with the lexicographic order, we define b k = 1 if s contains the k-th base pair in E, otherwise b k = 0 . The base-pair distance d bp (s, s ′ ) between two structures s and s ′ is the Hamming distance between their corresponding bit strings b(s) and b(s ′ ) .The 0-1 signature (or simply signature) of a structure s, is a vector q(s) = (q 1 , q 2 , . . . , q n ) , where q k = 0 when the k-th base is unpaired in s, otherwise q k = 1 . The signature distance d sn (s, s ′ ) between two structures s and s ′ is defined as the Hamming distance between their corresponding 0-1 signatures q(s) and q(s ′ ) . By construction, the 0-1 signature of a secondary structure mimics its probing signals, and the signature distance measures the similarity between the probing profiles of two structures. By observing that each bit corresponds to two base-pairing end, we derive d sn (s, s ′ ) ≤ 2d bp (s, s ′ ) for any s and s ′ .

Computational prediction of RNA secondary structures is mainly driven by loop-based energy models [25, 26] .The key assumption of these approaches is that the free energy E(s) of an RNA secondary structure s, is estimated by the sum of energy contributions E(L) from its individual loops L, E(s) = L E(L).According to thermodynamics, the free energy reflects not only the overall stability of the structure, but also its probability appearing in thermodynamic equilibrium. This leads to the Boltzmann sampling [4, 12] of secondary structure based on their equilibrium probabilities, whose computation can be facilitated by the partition function [3] .In this model, the energy contribution of a base pair depends on the two adjacent loops that intersect at the base pair, see Fig. 17 (LHS). Note that, in a pseudoknot, since two adjacent loops may intersect at several base pairs, and thus the energy contribution of a base pair could affect several loops, see Fig. 17 (RHS).

The basic idea of RNA structure probing is that chemical probes react differently with paired or unpaired nucleotides. More reactive regions of the RNA are likely to be single stranded and less reactive regions are likely to be base paired. Thus every nucleotide in a folded RNA sequence can be assigned a reactivity score, which depends on the type of chemical or enzymatic footprinting experiments and the strength of the reactivity. It is rarely of absolute certainty, whether or For and probe , we utilize the same Boltzmann samples as described in Fig. 18 . We shall proceed by excluding p 0 ≤ g(E) . A contradiction, suppose that p 0 < 0.5 and that structures in ′ are arranged in descending order according to their probabilities p i for i = 0, 1, . . . , k . Since each structure in ′ has probability smaller than 0.5, the sample ′ contains at least three different structures, i.e. k ≥ 2 . By construction, we have p i ≤ p 0 < 0.5 . Now we consider the following optimization problemWe inspect that the multivariate function k i=0 p i log 2 p i reaches its minimum 1 only for p 0 = p 1 = 0.5 and p i = 0 for i ≥ 2 . In the case of p 0 < 0.5 , the minimum cannot be reached and we arrive at some E > 1 , in contradiction to our assumption E ≤ 1 . Therefore p 0 ≥ f (E) is the only possible scenario, i.e., ′ contains a distinguished structure with probability at least f(E).

Here we will provide proofs of the information-theoretic results on the Boltzmann ensemble of secondary structures. In particular, we point out that these results on the Boltzmann ensemble hold in the more general setup, i.e., discrete probability spaces. Let (�, P(�), p) be a discrete probability space consisting of the sample space , its power set P(�) as the σ-algebra and the probability measure p. We shall refer to the space as . The Shannon entropy of is given by where the units of H are in bits.A feature X is a discrete random variable defined on . Assume that X has a finite number of values x 1 , x 2 , . . . , x k .Set q i = P(X = x i ) . The Shannon entropy H(X) of the feature X is given by In particular, the values of X define a partition of into disjoint subsets � i = {s ∈ � : X(s) = x i } , for 1 ≤ i ≤ k . This further induces k spaces (� i , P(� i ), p i ) , where the induced distribution is given by and q i denotes the probability of X having value x i and is given by Let H(�|X) denote the conditional entropy of given the value of X. The entropy H(�|X) gives the expected value of the entropies of the conditional distributions on , averaged over the conditioning feature X and can be computed by Then the entropy reduction R(�, X) of for feature X is the difference between the a priori Shannon entropy H(�) and the conditional entropy H(�|X) , i.e.The entropy reduction indicates the change on average in information entropy from a prior state to a state that takes some information as given. Now we prove Propositions 2 and 3.