key: cord-0035978-974e8vma authors: Ilie, Lucian; Tinta, Liviu; Popescu, Cristian; Hill, Kathleen A. title: Viral Genome Compression date: 2006 journal: DNA Computing DOI: 10.1007/11925903_9 sha: 74fe055f38956eaba24c22fbe093a9517d32a4b9 doc_id: 35978 cord_uid: 974e8vma Viruses compress their genome to reduce space. One of the main techniques is overlapping genes. We model this process by the shortest common superstring problem, that is, we look for the shortest genome which still contains all genes. We give an algorithm for computing optimal solutions which is slow in the number of strings but fast (linear) in their total length. This algorithm is used for a number of viruses with relatively few genes. When the number of genes is larger, we compute approximate solutions using the greedy algorithm which gives an upper bound for the optimal solution. We give also a lower bound for the shortest common superstring problem. The results obtained are then compared with what happens in nature. Remarkably, the compression obtained by viruses is quite high and also very close to the one achieved by modern computers. According to [5] , all virus genomes experience pressure to minimize their size. For example, those with prokaryotic hosts must be able to replicate quickly to keep up with their host cells. In the case of viruses with eukaryotic hosts, the pressure on the genome size comes from the small size of the virus, that is, from the amount of nucleic acid that can be incorporated. One way to reduce the size of their genome is by overlapping genes. Some viruses show tremendous compression of genetic information when compared with the low density of information in the genomes of eukaryotic cells. As claimed in [5] , overlapping genes are common and "the maximum genetic capacity is compressed into the minimum genome size." This property looks very interesting from mathematical point of view and we found it surprising that it was not much investigated. Daley and McQuillan [9] introduces and investigates a number of formal language theory operations motivated by the biological phenomenon. Krakauer [12] discusses genomic compression in general as achieved through reduced redundancy, overlapping genes, or translational coupling. In this paper, we investigate this property by naturally modelling it as the shortest common superstring problem (SCS). The genes are seen as strings and we look for the shortest superstring that contains them all. A variation is also considered due to the retrograde overlaps which may be present in some viruses. The SCS problem is known to be NP-hard. We give an algorithm to compute optimal solutions which works well when the number of strings is not too high. The algorithm is conceptually very simple and also very fast with respect to the total length of all strings. We used this algorithm for those viral genomes whose number of genes is not very high. When the number of strings increases, we are no longer able to find optimal solutions and use a greedy algorithm for an approximation. This gives an upper bound for the length of a shortest superstring and, for a better estimate, we provide also a lower bound. Finally, our results are compared with those obtained by viruses. The amount of compression using gene overlapping achieved by the viruses is remarkable; in all examples considered, it is the same or very close to the one obtained by modern computers. The biological significance of these results is to be investigated. Aside from the compression achieved in nature, any solution (or lower bound) for the corresponding SCS problem provides a limitation on the size of a viral genome which contains a given set of genes. Again, the biological relevance of such results remains to be clarified. Let Σ be an alphabet, that is, a finite non-empty set. Such an alphabet can be the set of four nucleotides {A, T, C, G}. We denote by Σ * the set of all finite strings over Σ. The empty word is denoted ε. Given a string w ∈ Σ * , w = a 1 a 2 · · · a n , a i ∈ Σ, the length of w is |w| = n; the length of ε is 0. We also denote w[i] = a i and w[i..j] = a i a i+1 · · · a j , for all 1 ≤ i ≤ j ≤ n. The reversal of w is a n a n−1 · · · a 1 . If w = xyz, for some w, x, y, z ∈ Σ * , then x, y, and z are a prefix, factor (or substring), and suffix of w, resp. The prefix (suffix) of length n of w is denoted pref n (w) (suff n (w)). For further notions and results on string combinatorics and algorithms we refer to [14] and [7] . The formal definition of the shortest common superstring problem (SCS) is: given k strings w 1 , w 2 , . . ., w k , find a shortest string w which contains all w i s as factors; such a w is usually called a shortest common superstring. Any superstring will be called a solution, whereas a shortest one is an optimal solution. Example 1. Consider the strings w 1 = baac, w 2 = aacc, and w 3 = acaa. A shortest superstring has length 8; it is baacaacc. The SCS problem has many applications. Data compression is one of the fields where the SCS problem is very useful because data may be stored very efficiently as a superstring; see [10] , [15] . This superstring contains all the information in a compressed form. Computational biology is another field where SCS can be applied; see [13] . The SCS problem was proved to be NP-hard in [10] and then MAX SNPhard in [3] . Therefore, it is unlikely to have polynomial time exact algorithms and research focussed mainly on approximation algorithms [17, 8, 11, 1, 2, 4] . The best approximation algorithm to date is due to Sweedyk [16] and can reach an approximation ratio of 2 1 2 . Still, in practice the very simple greedy algorithm is used with very good results. Blum et al. [3] proved that greedy is a 4-approximation algorithm. The still open conjecture is that the approximation factor is 2, which would be optimal as there are examples for which greedy produces no better approximations. As already mentioned in the introduction, viruses can overlap their genes. There are several types of overlaps. First we need to recall the DNA complementarity: the two strands of DNA are complementary and have opposite direction. The complementarity is such that whenever an A occurs on one strand, a T must appear on the other; we say that A and T are complementary. Similarly, C and G are complementary. We denote the complement of a nucleotide N by N . That is, we have A = T, C = G, and vice versa. Also, A = A. Complementarity is needed to understand retrograde overlapping. For a string w = a 1 a 2 · · · a |w| , we construct the complemented reversal of w, w = a |w| a |w|−1 · · · a 1 . When w appears in one strand, w occurs opposite it in the other strand. Fig. 1 we have overlaps on the same strand, that is, direct overlaps; one is called suffix overlap and the other prefix overlap but such a difference is irrelevant for us. In correspond to an x in the lower strand. Again, one is called head-on overlap, the other end-on overlap, without relevance for our purpose. In order to give some algorithms for optimal or approximate solutions for the SCS problem, we need to compute overlaps between strings. Also, we need to eliminate those strings which are factors of others. An overlap between two given strings u and v is any suffix of u that is also a prefix of v. We shall need only the longest overlaps but our algorithm computes them all in the same optimal time. The set overlaps(u, v) contains the lengths of all suffixes of u that are prefixes of v. We denote by overlap(u, v) the length of the longest overlap. Here is an example. To compute overlaps, we shall use a classical notion in pattern matching: a border of a string w is any string which is both a prefix and a suffix of w; the border of w, denoted border(w), is the longest non-trivial border of w, that is, different from w itself. Notice that all borders of w are: border(w), border 2 (w) = border(border(w)), border 3 (w), . . . , ε. Denote |w| = n and consider the array border w [0..n], where, for all We use borders to solve our problem. Assume we are given two strings u and v. Consider a new letters # (which does not appear in u or v) and construct the string w = v#u. It is clear that any border of w gives an overlap of u and v and vice versa. Therefore, using borders, we obtain an algorithm for computing overlaps which is linear in terms of |u| + |v|. Notice, however, that if one of the strings is much longer than the other, then we do not need the whole long string but just a short piece of it. An algorithm which works in linear time in the size of the shorter string would simply consider the string pref We can also do it all at once. For the SCS problem, we always exclude from calculations the strings which are included as factors in others. This is pattern searching and there are many linear time algorithms for it. We can also use the borders as above to give a simple algorithm to both identify factors and compute overlaps. We consider w = v#u. Assuming |v| ≤ |u|, v is a factor of u if and only if there is i such that border w (i) = |v|. if borderw[i] = |v| and |v| ≤ |u| then 11. return This algorithm is linear in |u| + |v|; this is optimal since it is the minimum required for searching. We may assume that none of the strings w i appears as factor of another one. (We check this in the algorithm.) Therefore, for any solution w of SCS, there is a permutation σ on k elements such that w contains each w i as a factor starting at position p i and p σ(1) < p σ(2) < · · · < p σ(k) . Example 5. For the strings in Example 1, the optimal solution is given by the permutation (1, 3, 2) . Therefore, our brute-force algorithm to compute an optimal solution of SCS will try all such permutations σ; the set of all permutations on k elements is the symmetric group S k . For each permutation, we need the maximum overlap between w σ(i) and w σ(i+1) . No other overlaps are needed. Assuming that w σ(i) and w σ(i+1) overlap each other on a length less than their maximal overlap. Then we can simply overlap them more to obtain a shorter superstring. We shall need one more definition. For two strings u and v which are not factors of each other, we denote by merge(u, v) the string obtained by overlapping them as much as possible, that is, Here is the algorithm. scs-optimal(w1, w2, . . . , w k ) 1. for i from 1 to k do 2. for j from 1 to k do 3. if i = j then 4. overlap(wi, wj ) ← overlaps-and-factors(wi, wj ) 5. if overlap(wi, wj ) = −1 then eliminate wi 6. scs ← k i=0 |wi| [ we use the same k but it may be smaller ] 7. for all σ ∈ S k do 8. w ← w σ(1) 9. for i from 2 to k do 10. w ← merge(w, w σ(i) ) 11. if scs > |w| then 12. scs ← |w| 13. return scs Proposition 1. The algorithm scs-optimal(w 1 , w 2 , . . . , w k ) computes an optimal solution for SCS and runs in time Proof. The correctness follows from the fact that we try all permutations. As explained above, after eliminating strings which appear as factors of others, it it enough to consider only longest overlaps. The time complexity for the preprocessing steps 1-5 is O(k 2 ), because of Lemma 1. In the main processing part, steps 7-12, we repeat k! times something linear in . This is the dominant order. As the SCS problem is NP-hard, in practice approximation algorithms are often used to find a superstring which may not be shortest but hopefully close to optimal. The most common such algorithm for SCS is the greedy algorithm, which we describe below. It uses the natural idea of considering the longer overlaps first. It may not produce an optimal solution but it cannot be too far away. Here is an example when the greedy algorithm does not give an optimal solution. Example 7. Consider again the strings in Example 1, w 1 = baac, w 2 = aacc, and w 3 = acaa. The overlaps are shown below: The greedy algorithm chooses first the longest overlap, that is, overlap(w 1 , w 2 ), and obtains the string baaccacaa of length 9, since merge(w 1 , w 2 ) and w 3 have no overlap. But there is a shorter one, given by the permutation (1, 3, 2) , of length 8, that is baacaacc. It is conjectured that the greedy solution is always at most twice longer than optimal; see [16] and the references therein for approximation algorithms for the SCS problem. In practice, the greedy algorithm works pretty well, as we shall see also in our experiments. scs-greedy(w1, w2, . . . , w k ) 1. compute overlaps and eliminate factors as before 2. greedy scs ← k i=0 |wi| 3. for all (i, j) with overlap(wi, wj) = max (s,t) overlap(ws, wt) do eliminate wi and wj from the list 5. add w = merge(wi, wj) to the list 6. denote the new list w 1 , . . . , w k−1 7. the overlaps of w are given by wi for prefix and by wj for suffix 8. ← scs-greedy(w 1 , w 2 , . . . , w k−1 ) 9. if greedy scs > then 10. greedy scs ← 11. return greedy scs The greedy algorithm gives an upper bound for the shortest length of a common superstring. We give in this section a lower bound for the length of the shortest superstring. It is computed using also a greedy approach but without checking if it is possible to actually find a superstring which uses the considered overlaps. (When this is possible, we have an optimal solution of SCS.) Any superstring w is defined by a permutation σ on k elements which gives k − 1 overlaps. Also, the length of the superstring is the total length of all strings minus the total length of overlaps, that is, w σ(i+1) ). For our estimate, we consider the matrix of overlaps, (overlap(w i , w j )) 1≤i =j≤k . A permutation σ as above gives k −1 overlaps such that no two are in the same row or column. We relax this condition by considering only rows or only columns. Choosing k − 1 longest overlaps such that no two are on the same row gives a lower bound. Similarly for columns. The algorithm below computes the first one. The second is computed analogously. We assume the matrix of overlaps has already been computed. lower- bound-row(w1, w2, . . . , w k ) 1. sort all elements of the matrix (overlap(wi, wj)) 1≤i =j≤k decreasingly 2. to obtain overlap(wi 1 , wj 1 ), . . . , overlap(wi n 2 −n , wj n 2 −n ) 3. lower bound row ← 0 4. rows used ← 0 5. t ← 1 6. while rows used < k − 1 do 7. if row it not used then 8. lower bound row ← lower bound row + |wi t | − overlap(wi t , wj t ) 9. mark row it as used 10. rows used ← rows used + 1 11. t ← t + 1 12. lower bound row ← lower bound row + |wj t−1 | 13. return lower bound row For correctness, it is enough to prove that the sum of the overlaps chosen by the algorithm is larger than the sum of overlaps corresponding to an optimal solution. In both cases, we have k − 1 overlaps involved, no two in the same row. Assume that an optimal solution chooses all rows except for the ith whereas our algorithm for the lower bound misses only the jth row. In all rows chosen by both, the overlap included for the lower bound is at least as large. If i = j, this proves that we obtain indeed a lower bound. If i = j, then the overlap chosen for the lower bound from row i is larger than the one for the optimal solution in row j as the former appear first in the sorted list from step 2. As already mentioned, another lower bound is obtained similarly, by choosing k − 1 elements from different columns in the overlap matrix; denote this lower bound by lower bound col. We have then the following lower bound: lower bound scs = max(lower bound row, lower bound col). The next result, which summarizes the above discussed bounds, is clear. lower bound row = 7, because of overlap(w 1 , w 2 ) and overlap(w 3 , w 2 ), lower bound col = 7, because of overlap(w 1 , w 2 ) and overlap(w 1 , w 3 ), lower bound scs = 7, scs = 8, greedy scs = 9. The lower bound cannot be achieved however, as it involves the beginning of w 2 (or the end of w 1 ) twice. Also, it happened that the lower bounds corresponding to rows and columns are the same; this is not true in general. The possibility of retrograde overlaps (see Fig. 2 ) further complicates the search for solutions, optimal or approximate. Each string may appear in a superstring as it is or as its complemented reversal. Therefore, we need first to compute more overlaps. The following equalities help computing only half of all possible ones: For the exact algorithm, we need to consider, for each string w i , whether w i or w i appears at position p σ(i) , which makes the algorithm even slower in the number of strings. The greedy algorithm works rather similarly. Only the overlaps for the merged strings need to be set a bit differently. For instance, if the overlap between w i and w j is chosen, then the string merge(w i , w j ) is added and its overlaps are taken from those given by prefixes of w i and w j . The lower bound is computed similarly. When choosing a certain overlap, the proper rows or columns need to be discarded for further consideration. For instance, in case of lower bound row, if the overlap between w i and w j is chosen, then all overlaps involving the suffix of of w i must be discarded, that is, all pairs (w i , w s ), (w i , w s ), (w s , w i ) and (w s , w i ). We show in this section our computations for a number of viral genomes which were obtained from "The National Center for Biotechnology Information," (web site www.ncbi.nlm.nih.gov). We start with a set of strings which are the genes and try to find a short superstring. Then we compare our result with the one achieved by the viruses. Notice that the time complexity of our exact algorithm grows very fast with the number of genes, but is linear in the total length. We managed to obtain exact solutions in Table 1 for a number of single stranded RNA viral genomes with relatively few genes. The columns give, in order, the family, the name of the virus, the total length of all genes, the compression achieved by the virus (total length of coding regions), and the shortest common superstring. All lengths are given in number of nucleotides. For genomes with more genes, we had to use the approximation algorithms. The results for a number of double stranded DNA viral genomes are shown in Table 2 . The columns have similar meaning, except that the one for the shortest common superstring is replaced by two: greedy and lower bound. All lengths are given in number of base pairs. The compression achieved by the viruses is, on average, 7.98%, that is, the (average) ratio between the reduction in size (total length of all genes minus viral coding) and the initial size (total length of genes). For the viruses in the first table, the ratio is higher, 11.95%, whereas for the second table it is 3.36%. The average compression ratio is remarkably high if we keep in mind that DNA molecules (seen as strings) are very difficult to compress in general. Commercial file-compression programs achieve usually no compression at all and the best especially designed algorithms, see [6] , can achieve something like 13.73% (that is the average for DNACompress from [6] , the best such algorithm to date). Also, the compression achieved by viruses is very close to what we can do (using overlapping only) by computers. The above averages, for all viruses considered, single stranded RNA, and double stranded DNA viruses are 8.11% (only 0.13% better than viruses), 11.99%, and 3.59%, resp. For the second table we used the greedy compression; it should also be noticed that our lower bound behaves pretty well. To give a better idea of the overlaps, Figs. 3-14 at the end show all genomes considered above as they appear in nature with the non-coding regions removed (top) and then as computed by our programs (bottom). The overlaps and different strands are shown in different color. (The figures are most useful in the electronic version of the paper.) Improved length bounds for the shortest superstring problem A 2 2 3 approximation algorithm for the shortest superstring problem Linear approximation of shortest superstrings Rotations of periodic strings and short superstrings Principles of Molecular Virology DNACompress: fast and effective DNA sequence compression Parallel and sequential approximations of shortest superstrings Viral gene compression: complexity and verification On finding minimal length superstrings Long tours and short superstrings Evolutionary principles of genomic compression Introduction to Bioinformatics Algebraic Combinatorics on Words 1 2 -approximation algorithms for shortest superstring Approximating shortest superstrings