key: cord-0043821-h2ip2jqf authors: Fleischmann, Pamela; Lejeune, Marie; Manea, Florin; Nowotka, Dirk; Rigo, Michel title: Reconstructing Words from Right-Bounded-Block Words date: 2020-05-26 journal: Developments in Language Theory DOI: 10.1007/978-3-030-48516-0_8 sha: 35733587400865fb17279825842281bf35b85b1c doc_id: 43821 cord_uid: h2ip2jqf A reconstruction problem of words from scattered factors asks for the minimal information, like multisets of scattered factors of a given length or the number of occurrences of scattered factors from a given set, necessary to uniquely determine a word. We show that a word [Formula: see text] can be reconstructed from the number of occurrences of at most [Formula: see text] scattered factors of the form [Formula: see text], where [Formula: see text] is the number of occurrences of the letter [Formula: see text] in w. Moreover, we generalize the result to alphabets of the form [Formula: see text] by showing that at most [Formula: see text] scattered factors suffices to reconstruct w. Both results improve on the upper bounds known so far. Complexity time bounds on reconstruction algorithms are also considered here. The general scheme for a so-called reconstruction problem is the following one: given a sufficient amount of information about substructures of a hidden discrete structure, can one uniquely determine this structure? In particular, what are the fragments about the structure needed to recover it all. For instance, a square matrix of size at least 5 can be reconstructed from its principal minors given in any order [20] . In graph theory, given some subgraphs of a graph (these subgraphs may share some common vertices and edges), can one uniquely rebuild the original graph? Given a finite undirected graph G = (V, E) with n vertices, consider the multiset made of the n induced subgraphs of G obtained by deleting exactly one vertex from G. In particular, one knows how many isomorphic subgraphs of a given class appear. Two graphs leading to the same multiset (generally called a deck) are said to be hypomorphic. A conjecture due to Kelly and Ulam states that two hypomorphic graphs with at least three vertices are isomorphic [14, 21] . A similar conjecture in terms of edge-deleted subgraphs has been proposed by Harary [11] . These conjectures are known to hold true for several families of graphs. A finite word, i.e., a finite sequence of letters of some given alphabet, can be seen as an edge-or vertex-labeled linear tree. So variants of the graph reconstruction problem can be considered and are of independent interest. Participants of the Oberwolfach meeting on Combinatorics on Words in 2010 [2] gave a list of 18 important open problems in the field. Amongst them, the twelfth problem is stated as reconstruction from subwords of given length. In the following statement and all along the paper, a subword of a word is understood as a subsequence of not necessarily contiguous letters from this word, i.e., subwords can be obtained by deleting letters from the given word. To highlight this latter property, they are often called scattered subwords or scattered factors, which is the notion we are going to use. Definition 1. Let k, n be natural numbers. Words of length n over a given alphabet are said to be k-reconstructible whenever the multiset of scattered factors of length k (or k-deck) uniquely determines any word of length n. Notice that the definition requires multisets to store the information how often a scattered factor occurs in the words. For instance, the scattered factor ba occurs three times in baba which provides more information for the reconstruction than the mere fact that ba is a scattered factor. The challenge is to determine the function f (n) = k where k is the least integer for which words of length n are k-reconstructible. This problem has been studied by several authors and one of the first trace goes back to 1973 [13] . Results in that direction have been obtained by M.-P. Schützenberger (with the so-called Schützenberger's Guessing game) and L. Simon [25] . They show that words of length n sharing the same multiset of scattered factors of length up to n/2 + 1 are the same. Consequently, words of length n are ( n/2 + 1)-reconstructible. In [15] , this upper bound has been improved: Krasikov n . Bounds were also considered in [19] . Algorithmic complexity of the reconstruction problem is discussed, for instance, in [5] . Note that the different types of reconstruction problems have application in philogenetic networks, see, e.g., [12] , or in the context of molecular genetics [7] and coding theory [16] . Another motivation, close to combinatorics on words, stems from the study of k-binomial equivalence of finite words and k-binomial complexity of infinite words (see [23] for more details). Given two words of the same length, they are k-binomially equivalent if they have the same multiset of scattered factors of length k, also known as k-spectrum ( [1, 18, 24] ). Given two words x and y of the same length, one can address the following problem: decide whether or not x and y are k-binomially equivalent? A polynomial time decision algorithm based on automata and a probabilistic algorithm have been addressed in [10] . A variation of our work would be to find, given k and n, a minimal set of scattered factors for which the knowledge of the number of occurrences in x and y permits to decide k-binomial equivalence. Over an alphabet of size q, there are q k pairwise distinct length-k factors. If we relax the requirement of only considering scattered factors of the same length, another interesting question is to look for a minimal (in terms of cardinality) multiset of scattered factors to reconstruct entirely a word. Let the binomial coefficient u x be the number of occurrences of x as a scattered factor of u. The general problem addressed in this paper is therefore the following one. Problem 2. Let Σ be a given alphabet and n a natural number. We want to reconstruct a hidden word w ∈ Σ n . To that aim, we are allowed to pick a word u i and ask questions of the type "What is the value of w ui ?". Based on the answers to questions related to w u1 , . . . , w ui , we can decide which will be the next question (i.e. decide which word will be u i+1 ). We want to have the shortest sequence (u 1 , . . . , u k ) uniquely determining w by knowing the values of w u1 , . . . , w u k . We naturally look for a value of k less than the upper bound for kreconstructibility. In this paper, we firstly recall the use of Lyndon words in the context of reconstructibility. A word w over a totally ordered alphabet is called Lyndon word if it is the lexicographically smallest amongst all its rotations, i.e., w = xy is smaller than yx for all non trivial factorisations w = xy. Every binomial coefficient w x for arbitrary words w and x over the same alphabet can be deduced from the values of the coefficients w u for Lyndon words u that are lexicographically less than or equal to x. This result is presented in Sect. 2 along with the basic definitions. We consider an alphabet equipped with a total order on the letters. Words of the form a n b with letters a < b and a natural number n are a special form of Lyndon words, the so-called right-bounded-block words. We consider the reconstruction problem from the information given by the occurrences of right-bounded-block words as scattered factors of a word of length n. In Sect. 3 we show how to reconstruct a word uniquely from m + 1 binomial coefficients of right-bounded-block words where m is the minimum number of occurrences of a and b in the word. We also prove that this is less than the upper bound given in [15] . In Sect. 4 we reduce the problem for arbitrary finite alphabets {1, . . . , q} to the binary case. Here we show that at most q−1 i=1 |w| i (q − i + 1) ≤ q|w| binomial coefficients suffice to uniquely reconstruct w with |w| i being the number of occurrences of letter i in w. Again, we compare this bound to the best known one for the classical reconstruction problem (from words of a given length). In the last section of the paper we also propose several results of algorithmic nature regarding the efficient reconstruction of words from given scattered factors. Due to space restrictions some proofs (marked with * ) can be found in [9] . Let N be the set of natural numbers, N 0 = N ∪ {0}, and let N ≥k be the set of all natural numbers greater than or equal to k. An alphabet Σ = {a, b, c, . . . } is a finite set of letters and a word is a finite sequence of letters. We let Σ * denote the set of all finite words over Σ. The empty word is denoted by ε and Σ + is the free semigroup Σ * \{ε}. The length of a word w is denoted by |w|. Let Σ ≤k := {w ∈ Σ * | |w| ≤ k} and Σ k be the set of all words of length exactly k ∈ N. The number of occurrences of a letter a ∈ Σ in a word w ∈ Σ * is denoted by |w| a . The i th letter of a word w is given by w[i] for i ∈ [|w|]. The powers of w ∈ Σ * are defined recursively by w 0 = ε, w n = ww n−1 for n ∈ N. A word u ∈ Σ * is a factor of w ∈ Σ * , if w = xuy holds for some words x, y ∈ Σ * . Moreover, u is a prefix of w if x = ε holds and a suffix if y = ε holds. The factor of w from the i th to the j th letter will be denoted by Additional basic information about combinatorics on words can be found in [17] . For words w, u ∈ Σ * , define w u as the number of occurrences of u as a scattered factor of w. The following definition addresses Problem 2. is also the 2-vector of binomial coefficients of baab. On the other hand S = {a, ab, ab 2 } reconstructs w uniquely. The following remark gives immediate results for binary alphabets. Let Σ = {a, b} and w ∈ Σ n . If |w| a ∈ {0, n} then w contains either only b or a and by the given length n of w, w is uniquely determined by S = {a}. This fact is in particular an equivalence: w ∈ Σ n can be uniquely determined by {a} iff |w| a ∈ {0, n}. If |w| a ∈ {1, n − 1}, w is not uniquely determined by {a} as witnessed by ab and ba for n = 2. It is immediately clear that the additional information w ab leads to unique determinism of w. Lyndon words play an important role regarding the reconstruction problem. As shown in [22] only scattered factors which are Lyndon words are necessary to determine a word uniquely, i.e., S can always be assumed to be a set of Lyndon words. To obtain a formula to compute the binomial coefficient w , and k ∈ N the definitions of shuffle and infiltration are necessary [17] . where the sum has to be taken over all pairs (I 1 , I 2 ) of sets that are partitions of [n] such that The infiltration is a variant of the shuffle in which equal letters can be merged. Definition 11. Let n 1 , n 2 ∈ N, u 1 ∈ Σ n1 , and u 2 ∈ Σ n2 . Set n = n 1 + n 2 . The infiltration of u 1 and u 2 is the polynomial u 1 ↓ u 2 = I1,I2 w(I 1 , I 2 ), where the sum has to be taken over all pairs (I 1 , I 2 ) of sets of cardinality n 1 and n 2 respectively, for which the union is equal to the set [n ] for some n ≤ n. Words w(I 1 , I 2 ) are defined as in the previous definition. Note that some w(I 1 , In that case they do not appear in the previous sum. Considering for instance u 1 = aba and u 2 = ab gives the polynomials Based on Definitions 10 and 11, we are able to give a formula to compute a binomial coefficient from the ones making use of Lyndon words. This formula is given implicitely in [22, Theorem 6.4] : Let u ∈ Σ * be a non-Lyndon word. By [22, Corollary 6.2] there exist non-empty words x, y ∈ Σ * and with u = xy and such that every word appearing in the polynomial x y is lexicographically less than or equal to u. Then, for all word w ∈ Σ * , we have where (P, v) is a notation giving the coefficient of the word v in the polynomial P . One may apply recursively this formula until only Lyndon factors are considered. Some examples can be found in [9] . In this section we present a method to reconstruct a binary word uniquely from binomial coefficients of right-bounded-block words. Let n ∈ N be a natural number and w ∈ {a, b} n a word. Since the word length n is assumed to be known, |w| a is known if |w| b is given and vice versa. Set for abbreviation k u = w u for u ∈ Σ * . Moreover we assume w.l.o.g. k a ≤ k b and that k a is known (otherwise substitute each a by b and each b by a, apply the following reconstruction method and revert the substitution). This implies that w is of the form we have c i < c i+1 and especially c +1 = 1 and c +2 = + 1. Equation (2) shows that reconstructing a word uniquely from binomial coefficients of right-bounded-block words equates to solve a system of Diophantine equations. The knowledge of k b , . . . , k a b provides + 1 equations. If the equation of k a b has a unique solution for {s +1 , . . . , s ka+1 } (in this case we say, by language abuse, that k a b is unique), then the system in row echelon form has a unique solution and thus the binary word is uniquely reconstructible. Notice that k a ka b is always unique since k a ka b = s ka+1 . Consider n = 10 and k a = 4. This leads to w = b s1 ab s2 ab s3 ab s4 ab s5 with i∈ [5] s i = 6. Given k ab = 4 we get 4 = s 2 + 2s 3 + 3s 4 + 4s 5 . The s i are not uniquely determined. If k a 2 b = 2 is also given, we obtain the equation 2 = s 3 + 3s 4 + 6s 5 and thus s 3 = 2 and s 4 = s 5 = 0 is the only solution. Substituting these results in the previous equation leads to s 2 = 0 and since we only have six b, we get s 1 = 4. Hence w = b 4 a 2 b 2 a 2 is uniquely reconstructed by S = {a, ab, a 2 b}. The following definition captures all solutions for the equation defined by k a b for ∈ [k a ] 0 . By Remark 12 the coefficients of each equation of the form (2) are strictly increasing. The next lemma provides the range each k a b may take under the constraint and c 1 , . . . , c k+1 , s 1 Proof. The case k = 0 is trivial. Consider the case n = k, i.e., This implies k i=j c i s i ≥ c k+1 . Since the coefficients are strictly increasing we get k i=j c i s i ≤ c k k i=j s i < c k+1 , hence the contradiction. Proof. It follows directly from Eq. (2) and Lemma 14. The following lemma shows some cases in which k a b is unique. Proof. Consider firstly k a b ∈ [ ] 0 . By Remark 12 we have c +1 = 1 and c +2 = + 1. By c i < c i+1 we obtain immediately s i = 0 for i ∈ [k a + 1]\[ + 1]. By setting s +1 = k a b the claim is proven. If k a b = ka (n − k a ), s ka+1 = (n − k a ) and s i = 0 for i ∈ [k a ] 0 is the only possibility. Let secondly be r ∈ [k b ] 0 and k a b = ka−1 r + ka (n − k a − r) and suppose that k a b is not unique. This (we only have r occurrences of b left to distribute). By r > r we have (ka−1)!(kar − r) Since we are not able to fully characterise the uniquely determined values for each k a b for arbitrary n and , the following proposition gives the characterisation for ∈ {0, 1}. Notice that we use k a immediately since it is determinable by n and k a 0 b = k b . The word w ∈ Σ n is uniquely determined by k a and k ab iff one of the following occurs k a = 0 or k a = n (and obviously k ab = 0), -k a = 1 or k a = n − 1 and k ab is arbitrary, -k a ∈ [n − 2] ≥2 and k ab ∈ {0, 1, k a (n − k a ) − 1, k a (n − k a )}. In all cases not covered by Proposition 17 the word cannot be uniquely determined by w a and w ab . The following theorem combines the reconstruction of a word with the binomial coefficients of right-bounded-block words. uniquely determined by {b, ab, a 2 b, . . . , a j b}. Proof. If k a j b is unique, the coefficients s j+1 , . . . , s ka+1 are uniquely determined. Substituting backwards the known values in the first j − 1 Eq. (2) (for = 1, . . . , j − 1) we can now obtain successively the values for s j , . . . , s 1 . Let be minimal such that k a b is unique. Then w is uniquely determined by {a, ab, a 2 b, . . . , a b} and not uniquely determined by any {a, ab, a 2 b, . . . , a i b} for i < . Proof. It follows directly from Theorem 18. By [15] an upper bound on the number of binomial coefficients to uniquely reconstruct the word w ∈ Σ n is given by the amount of the binomial coefficients of the ( 16 7 √ n +5)-spectrum. Notice that implicitly the full spectrum is assumed to be known. As proven in Sect. 2, Lyndon words up to this length suffice. Since there are 1 n d|n μ(d) · 2 n d Lyndon words of length n, the combination of both results presented in [15, 22] states that, for n > 6, 16 7 binomial coefficients are sufficient for a unique reconstruction with the Möbius function μ. Up to now, it was the best known upper bound. Theorem 18 shows that min{k a , k b } + 1 binomial coefficients are enough for reconstructing a binary word uniquely. By Proposition 17 we need exactly one binomial coefficient if n ∈ [3] and at most two if n = 4. For n ∈ {5, 6} we need at most n − 2 different binomial coefficients. The following theorem shows that by Theorem 18 we need strictly less binomial coefficients for n > 6. Theorem 20 ( * ). Let w ∈ Σ n . We have that min{k a , k b } + 1 binomial coefficients suffice to uniquely reconstruct w. If k a ≤ k b , then the set of sufficient binomial coefficients is S = {b, ab, a 2 b, ..., a h b} where h = n 2 . If k a > k b , then the set is S = {a, ba, b 2 a, ..., b h a}. This bound is strictly smaller than (3) . Remark 21. By Lemma 16 we know that k a b is unique if it is in [ ] 0 or exactly ka (n−k a ). The probability for the latter is 1 2 n for w ∈ {a, b} n . If k a b = m ∈ [ ] 0 we get by (2) immediately s +1 = m and s i = 0 for + 2 ≤ i ≤ k a + 1. Hence, the values for s j for j ∈ [ ] are not determined. possibilities to fulfill the constraints, i.e., we have a probability of d 2 n to have such a word. In this section we address the problem of reconstructing words over arbitrary alphabets from their scattered factors. We begin with a series of results of algorithmic nature. Let Σ = {a 1 , . . . , a q } be an alphabet equipped with the ordering a i < a j for 1 ≤ i < j ≤ q ∈ N. Let w 1 , . . . , w k ∈ Σ * for k ∈ N, and K = (k a ) a∈Σ a sequence of |Σ| natural numbers. A K−valid marking of w 1 , . . . , w k is a mapping ψ : For instance, let k = 2, Σ = {a, b}, and w 1 = aab, w 2 = abb. Let k a = 3, k b = 2 define the sequence K. A K-valid marking of w 1 , w 2 would be w ψ 1 = (a) 1 (a) 3 (b) 1 , w ψ 2 = (a) 2 (b) 1 (b) 2 defining ψ implicitly by the indices. We used parentheses in the marking of the letters in order to avoid confusions. We recall that a topological sorting of a directed graph G = (V, E), with V = {v 1 , . . . , v n }, is a linear ordering v σ(1) < v σ(2) < . . . < v σ(n) of the nodes, defined by the permutation σ : [n] → [n], such that there exists no edge in E from v σ(i) to v σ(j) for any i > j (i.e., if v a comes after v b in the linear ordering, for some a = σ(i) and b = σ(j), then we have i > j and there should be no edge between v a and v b ). It is a folklore result that any directed graph G has a topological sorting if and only if G is acyclic. Definition 23. Let w 1 , . . . , w k ∈ Σ * for k ∈ N, K = (k a ) a∈Σ a sequence of |Σ| natural numbers, and ψ a K−valid marking of w 1 , . . . , w k . Let G ψ be the graph that has a∈Σ k a nodes, labelled with the letters (a) 1 , . . . , (a) ka , for all a ∈ Σ, and the directed edges and ((a) i , (a) i+1 ) , for all occuring i and a ∈ Σ. We say that there exists a valid topological sorting of the ψ-marked letters of the words w 1 , . . . , w k if there exists a topological sorting of the nodes of G ψ , i.e., G ψ is a directed acyclic graph. The graph associated with the K-valid marking of w 1 , w 2 from above would have the five nodes (a) 1 , (a) 2 , (a) 3 , (b) 1 , (b) 2 and the six directed edges ((a) 1 , (a) 2 ), ((a) 2 , (a) 3 ) (where the direction of the edge is from the left node to the right node of the pair defining it). This graph has the topological sorting (a) 1 (a) 2 (a) 3 For w 1 , . . . , w k ∈ Σ * and a sequence K = (k a ) a∈Σ of |Σ| natural numbers, there exists a word w such that w i is a scattered factor of w with |w| a = k a , for all i ∈ [k] and all a ∈ Σ, if and only if there exist a Kvalid marking ψ of the words w 1 , . . . , w k and a valid topological sorting of the ψ-marked letters of the words w 1 , . . . , w k . Next we show that in Theorem 24 uniqueness propagates in the ⇐-direction. Proof. Let w be the word obtained by writing in order the letters of the unique valid topological sorting of the ψ-marked letters of the words w 1 , . . . , w k and removing their markings. It is clear that w has w i as a scattered factor, for all i ∈ [k], and that |w| a = k a , for all a ∈ Σ. The word w is uniquely defined (as there is no other K-valid marking nor valid topological sorting of the ψ-marked letters), and |w| a = k a , for all a ∈ Σ. In order to state the second result, we need the projection π S (w) of a word w ∈ Σ * on S ⊆ Σ: π S (w) is obtained from w by removing all letters from Σ \ S. Then there exists at most one w ∈ Σ * such that w a,b is π {a,b} (w) for all a, b ∈ Σ. Proof. Notice firstly |W | = q(q−1) 2 . Let k a = |w a,b | a , for a < b ∈ Σ. These numbers are clearly well defined, by the second item in our hypothesis. Let K = (k a ) a∈Σ . It is immediate that there exists a unique K-valid marking ψ of the words (w a,b ) a 0. Coming now back to combinatorial results, we use the method developed in Sect. 3 to reconstruct a word over an arbitrary alphabet. We show that we need at most i∈[q] |w| i (q + 1 − i) different binomial coefficients to reconstruct w uniquely for the alphabet Σ = {1, . . . , q}. In fact, following the results from the first part of this section, we apply this method on all combinations of two letters. Consider for an example that for w ∈ {a, b, n} 6 the following binomial coefficients w a 0 b = 1, w a 0 n = 2, w a 1 b = 0, w a 1 n = 3, w b 1 n = 2, and w a 2 n = 1 are given. By |w| = 6, |w| b = 1, and |w| n = 2, we get |w| a = 3. Applying the method from Sect. 3 for {a, b}, {a, n}, and {b, n} we obtain the scattered factors ba 3 , anana, and bn 2 . Combining all these three scattered factors gives us uniquely banana. Notice that in this example we only needed six binomial coefficients instead of ten, which is the worst case. Remark 30. As seen in the example we have not only the word length but also w x for all x ∈ Σ but one. Both information give us the remaining single letter binomial coefficient and hence we will assume that we know all of them. For convenience in the following theorem consider Σ = {1, . . . , q} for q > 2 and set α := 16 7 √ n + 5. In the general case the results by [22] and [15] yield that is smaller than the best known upper bound on the number of binomial coefficients sufficient to reconstruct a word uniquely. The following theorem generalises Theorem 20 on an arbitrary alphabet. Theorem 31 ( * ). For uniquely reconstructing a word w ∈ Σ * of length at least q − 1, i∈[q] |w| i (q + 1 − i) binomial coefficients suffice, which is strictly smaller than (4). Remark 32. Since the estimation in Theorem 31 depends on the distribution of the letters in contrast to the method of reconstruction, it is wise to choose an order < on Σ such that x < y if |w| x ≤ |w| y . In the example we have chosen the natural order a < b < n which leads in the worst case to fourteen binomial coefficients that has to be taken into consideration. If we chose the order b < n < a the formula from Theorem 31 provides that ten binomial coefficients suffice. This observation leads also to the fact that less binomial coefficients suffice for a unique determinism if the letters are not distributed equally but some letters occur very often and some only a few times. Remark 33. Let's note that the number of binomial coefficients we need is at most qn. Indeed, we will prove that i∈[q] |w| i (q + 1 − i) ≤ qn. We have qn = qn+n−n = q i∈ [q] |w| i + i∈ [q] |w| i − i∈ [q] |w| i ≥ q i∈ [q] |w| i + i∈ [q] |w| i − i∈ [q] (|w| i i) = i∈[q] |w| i (q + 1 − i). In this paper we have proven that a relaxation of the so far investigated reconstruction problem from scattered factors from k-spectra to arbitrary sets yields that less scattered factors than the best known upper bound are sufficient to reconstruct a word uniquely. Not only in the binary but also in the general case the distribution of the letters plays an important role: in the binary case the amount of necessary binomial coefficients is smaller the larger |w| a − |w| b is. The same observation results from the general case -if all letters are equally distributed in w then we need more binomial coefficients than in the case where some letters rarely occur and others occur much more often. Nevertheless the restriction to right-bounded-block words (that are intrinsically Lyndon words) shows that a word can be reconstructed by fewer binomial coefficients if scattered factors from different spectra are taken. Further investigations may lead into two directions: firstly a better characterisation of the uniqueness of the k a b would be helpful to understand better in which cases less than the worst case amount of binomial coefficients suffices and secondly other sets than the rightbounded-block words could be investigated for the reconstruction problem. Combinatorics on words -a tutorial Mini-workshop: combinatorics on words. Oberwolfach Rep Fine-grained complexity theory (tutorial) Introduction to Algorithms Reconstructing words from subwords in linear time Reconstruction from subsequences Subwords in reverse-complement order Irreducible polynomial modulo p, Bachelor thesis at Reconstructing words from right-bounded-block words Testing k-binomial equivalence On the reconstruction of a graph from a collection of subgraphs Leaf-reconstructibility of phylogenetic networks The reconstruction of a word from fragments A congruence theorem for trees On a reconstruction problem for sequences On perfect codes in deletion and insertion metric Combinatorics on Words Characterization of a word by its subwords Reconstruction of sequences On reconstruction of matrices Ulam's conjecture and graph reconstructions Free lie algebras Another generalization of abelian equivalence: binomial complexity of infinite words Handbook of Formal Languages (3 volumes) Piecewise testable events