key: cord-0042972-8ht5zuzn
authors: Oliveira, Andre Rodrigues; Jean, Géraldine; Fertin, Guillaume; Brito, Klairton Lima; Dias, Ulisses; Dias, Zanoni
title: A 3.5-Approximation Algorithm for Sorting by Intergenic Transpositions
date: 2020-02-01
journal: Algorithms for Computational Biology
DOI: 10.1007/978-3-030-42266-0_2
sha: 06749abc643cb7bd0e304dfaeec38a697a7f2588
doc_id: 42972
cord_uid: 8ht5zuzn

Genome Rearrangements affect large stretches of genomes during evolution. One of the most studied genome rearrangement is the transposition, which occurs when a sequence of genes is moved to another position inside the genome. Mathematical models have been used to estimate the evolutionary distance between two different genomes based on genome rearrangements. However, many of these models have focused only on the (order of the) genes of a genome, disregarding other important elements in it. Recently, researchers have shown that considering existing regions between each pair of genes, called intergenic regions, can enhance the distance estimation in realistic data. In this work, we study the transposition distance between two genomes, but we also consider intergenic regions, a problem we name Sorting Permutations by Intergenic Transpositions (SbIT). We show that this problem is NP-hard and propose a 3.5-approximation algorithm for it.

Genome rearrangements are events that modify genomes by inserting or removing large stretches of DNA sequences, or by changing the order and the orientation of genes inside genomes. A transposition [1] is a rearrangement that swaps the position of two adjacent sequences of genes inside a genome. Another example of genome rearrangement is the reversal [11] , that reverses the order and the orientation of a sequence of genes.

We compute the rearrangement distance between two genomes by determining the minimum number of events that transform one into another. A model M is a set of genome rearrangements that can be used to calculate the rearrangement distance.

Algorithms based on the rearrangement distance perform whole-genome comparison and may be used as a tool to infer phylogenetic relationships. The usual method fills a matrix of pairwise distances among genomes that is later used to generate phylogenetic trees [2, 12, 14] . As for "classical" rearrangements, having a large spectrum of models globally helps better-understanding things.

While in practice it is likely rarely so, if genomes contain no repeated gene and share the same set of n genes, they can be represented as permutations. Without loss of generality, we consider that one of these genomes is the identity permutation, i.e., the sorted permutation ι = (1 2 . . . n).

The Sorting by Rearrangements Problem thus consists in determining the shortest sequence of events from M that sorts a permutation π, i.e., that transforms it to ι. Sorting by Rearrangements has been extensively studied in the past. For instance, Sorting by Transpositions has been proved NP-hard [6] , while the best algorithm so far has an approximation factor of 1.375 [8] .

Representing genomes through their gene order (thus, by permutations) implies that information not contained directly in the genes is lost. In particular, in the case of intergenic regions, DNA sequences between the genes are not considered. Recently, some authors argued that incorporating intergenic regions sizes in the models changes the distance estimations, and actually improves them [3, 4] . It seems worth investigating models considering both gene order and intergenic sizes.

Results considering intergenic sizes for models with Double-Cut and Join (DCJ) and DCJs along with indels (i.e., insertions and deletions) are known: the former is NP-hard and it has a 4/3-approximation algorithm [9] , while the latter is polynomial [7] . In addition to the approximation algorithm, the authors in [7] also developed two exact algorithms: a fixed-parameter tractable algorithm and an integer linear programming formulation. Practical tests from [7] showed that statistical properties of the inferred scenarios using intergenic regions are closer to the true ones than scenarios which do not use them.

Some results considering intergenic sizes with super short operations (i.e., a reversal or a transposition applied to one or two genes of the genome) are known [13] . Using the concept of breakpoints, Brito et al. showed a 4-approximation algorithm (resp. 6-approximation algorithm) for sorting by reversals (resp. reversals and transpositions) when also considering intergenic regions on unsigned permutations [5] . They also showed that both problems are NP-hard.

In this paper, we investigate the transposition distance between genomes that also takes into account intergenic regions, a problem we name Sorting by Intergenic Transpositions (SbIT). Instead of using breakpoints, here we propose a modification of a known graph structure to represent both gene order and intergenic sizes in a single graph. We show that SbIT is NP-hard, and, with the help of this adapted graph structure, we design a 3.5-approximation algorithm.

This work is organized as follows. Section 2 presents some definitions we extensively use throughout this paper. Section 3 presents the graph structure we use to produce our approximation algorithm. Section 4 contains a series of intermediate lemmas that support our algorithm. Section 5 describes the 3.5approximation algorithm for SbIT. Section 6 concludes the paper.

A genome G is a sequence of n genes denoted by g i , with i ∈ [1..n], in which two consecutive genes g j−1 and g j , with j ∈ [2..n], are separated by a noncoding region called intergenic region, denoted by r j -that are also present on its extremities (r 1 and r n+1 ): G = r 1 , g 1 , r 2 , g 2 , . . . , r n , g n , r n+1 .

Given two genomes of closely related species, for the simplifying purposes of our initial analysis, we expect they will share the same set of genes, which may appear in different orders due to genome rearrangements. Selective pressures tend to conserve genes and not intergenic regions [3] . Therefore, genome rearrangements hardly cut inside genes, whereas cuts appear in intergenic regions.

Our model assumes that (i) no gene is duplicated in a genome and (ii) both genomes share the same set of genes. We assign unique integer numbers in the range [1. .n] to each gene and represent them as a permutation. Therefore, the sequence of genes in a genome is modeled by a permutation π = (π 1 π 2 . . . π n ), π i ∈ N, 1 ≤ π i ≤ n, and π i = π j for all i = j.

We represent intergenic regions by their lengths instead of assigning unique identifiers to each of them, which would be pointless because rearrangements may split intergenic regions several times. The sequence of intergenic regions around n genes is represented asπ = (π 1π2 . . .π n +1 ),π i ∈ N. Intergenic regionπ i is on the left side of π i , whereasπ i+1 is on the right side.

Our goal is to compute the distance between two genomes: (π,π) and (σ,σ). We may assign unique labels to genes arbitrarily, so we simplify the definition of our problem by setting σ as the identity permutation ι, such that σ = ι = (1 2 . . . n) andσ =ι, which describes all the information we need to our problem. Therefore, an instance of our problem is composed by three elements (π,π,ι), such that n+1 i=1π i = n+1 i=1ι i , which guarantees that total intergenic region lengths are conserved.

An intergenic transposition is an operation ρ (i,j,k) (x,y,z) , 1 ≤ i < j < k ≤ n+1, 0 ≤ x ≤π i , 0 ≤ y ≤π j , 0 ≤ z ≤π k , and {x, y, z} ⊂ N. An intergenic transposition acts on instances to generate new ones: (π,π,ι)·ρ (i,j,k) (x,y,z) = (π ,π ,ι), where (i) π = (π 1 π 2 . . . π i−1 π j π j+1 . . . π k−1 π i π i+1 . . . π j−1 π k π k+1 . . . π n ), and

As we can see, while ρ (i,j,k) (x,y,z) keepsι intact, it moves segments from π and π to other positions and also modifies the contents of three elements fromπ: it cutsπ i after first x nucleotides,π j after first y nucleotides, andπ k after first z nucleotides, and rearranges them as defined above. Figure 1 shows examples of instances and the application of an intergenic transposition.

The intergenic transposition distance d t (π,π,ι) is the minimum number m of intergenic transpositions ρ 1 , . . . , ρ m that transform π into ι, andπ intoι. Therefore, d t (π,π,ι) = m implies a minimal sequence (π,π,ι) · ρ 1 · . . . · ρ m = (ι,ι,ι).

Proof. The Sorting by Transpositions problem (SbT) has already been proved NP-hard [6] . An instance of this problem consists of a permutation γ and a nonnegative integer d. The goal is to determine if its possible to transform γ into ι applying at most d transpositions.

We can reduce all instances of SbT to instances of SbIT by setting π = γ andπ =ι = (0 0 ... 0). Note that it is possible to transform γ into ι applying at most d transpositions if and only if d t (π,π,ι) ≤ d.

From now on, we will refer to intergenic transposition as transposition only.

We adapted a graph structure called breakpoint graph [1, 10] to conveniently represent an instance (π,π,ι) in a single graph. This structure allows us to describe algorithms and prove approximation bounds. All definitions we propose here are exemplified in Figs. 2 and 3.

We represent a given instance by a weighted cycle graph G(π,π,ι) =

the set of edges that can be either gray or black, and w : E → N is a function mapping edges to values corresponding to intergenic region lengths. The black edge set is {e i = (−π i , +π i−1 ) : 1 ≤ i ≤ n + 1}, and w(e i ) =π i . The gray edge set is {e i = (+(i − 1), −i) : 1 ≤ i ≤ n + 1}, and w(e i ) =ι i . In this definition, we consider π 0 = 0 and π n+1 = n+1.

The graph can be drawn in many arbitrary ways, but it is more convenient to place its vertices on a horizontal line in the same order as the elements of π, so π 0 (resp. −π n+1 ) is the leftmost (resp. rightmost) element of it. In addition, Fig. 1 . Two (fictitious) genomes G1 and G2 that share 8 genes. We represent G1 as the identity permutation, which leads to G2 as the permutation π = (3 2 1 7 6 4 8 5). We assume that the number of nucleotides between genes are good estimators for intergenic regions lengths. For example, in G1 before "Gene 1" we have 3 nucleotides, between "Gene 1" and "Gene 2" we have 5, and so on. Thus, (π,π,ι) is such that π = ( 3 2 1 7 6 4 8 5),π = (1 7 0 4 2 7 4 0 9), andι = ( 3 5 1 5 6 3 7 2 2). Genome G3 represents (π ,π ,ι ) = (π,π,ι)·ρ (2, 5, 6) (2,1,2) , so π = ( 3 6 2 1 7 4 8 5),π = (1 3 7 0 4 6 4 0 9).

for each element π i ∈ π, vertex −π i ∈ G(π,π,ι) is drawn to the left of vertex +π i . Since black edges relate to π, they are drawn as horizontal lines, and we label the black edge e i as i. Gray edges are drawn as arcs.

Each vertex in G(π,π,ι) has a gray edge and a black edge, which allows a unique decomposition of edges in cycles of alternating colors. Each cycle C with black edges is represented as a list (c 1 , c 2 , . . . , c ) of the labels from its black edges, and to make the notation unique we assume c 1 to be the index of the "rightmost" black edge (i.e., the black edge with the highest label using our default drawing) and we traverse it from right to left. We follow by several definitions regarding cycles.

A cycle is long if it has 3 or more black edges; a cycle is short if it has 2 black edges; a cycle is trivial if it has 1 black edge; a cycle is non-trivial if it is either short or long. A non-trivial cycle C = (c 1 , . . . , c ) is non-oriented if c 1 , . . . , c is a decreasing sequence (note that every short cycle C is non-oriented); C is oriented otherwise.

Given a non-trivial cycle C = (c 1 , . . . , c ), every pair of black edges e c i and Figure 3 shows an example of a weighted cycle graph and the application of an intergenic transposition on it, as well as the weighted cycle graph for an instance (ι,ι,ι).

Let c(π,π,ι), c b (π,π,ι) , and c u (π,π,ι), denote the number of cycles, balanced cycles, and unbalanced cycles in G(π,π,ι), respectively. Lemma 2. The instance (ι,ι,ι) has two properties that do not occur together in any other instance: (i) c(ι,ι,ι) = n + 1, and (ii) c b (ι,ι,ι) = n + 1. As a consequence, we have that c u (ι,ι,ι) = 0 (see Fig. 3(c) for an example).

Note that Sorting by Intergenic Transpositions is more complicated than the Sorting by Transpositions because increasing the number of cycles is not sufficient -these cycles need to be balanced.

Given a sequence of transpositions S ρ = (ρ 1 , ρ 2 , . . . , ρ k ), let (π,π,ι) · S ρ denotes (π,π,ι) · ρ 1 · ρ 2 · . . . · ρ k such that ρ i+1 is always a transposition for (π,π,ι)·ρ 1 ·. . .·ρ i with 1 ≤ i < k. Let Δc(π,π,ι, S ρ ) = c((π,π,ι)·S ρ )−c(π,π,ι) and Δc b (π,π,ι, S ρ ) = c b ((π,π,ι) · S ρ ) − c b (π,π,ι) denote the variation in the number of cycles and balanced cycles, respectively, when S ρ is applied to (π,π,ι). Proof. From Lemma 3 we know that we can increase the number of cycles by at most 2. In this scenario, one cycle C is split in three by a single transposition ρ. If C is balanced, the best we can expect is that ρ creates three balanced cycles, so Δc b (π,π,ι, ρ) = 2. Otherwise, at least one of the resulting cycles shall be unbalanced too, since weights of black edges of the three cycles sum up to a value that is different from the sum of weights of gray edges. Therefore, the best we expect is that the other two cycles are balanced, so Δc b (π,π,ι, ρ) = 2.

Proof. By Lemma 2 we know that c b (ι,ι,ι) = n + 1. Therefore, our goal is to increase the number of cycles from c b (π,π,ι) to n+1; since this number increases by at most 2 for each transposition (Lemma 4), the lemma follows.

This section presents properties and lemmas to support the 3.5-approximation presented in Sect. 5 .

Let e c x and e c y be two arbitrary black edges in the same cycle C = (c 1 , . . . , c ), with {x, y} ⊂ [1.. ]. We define the function f :

In other words, given the path P of black and gray edges that goes from e c x to e c y , f (e c x , e c y ) computes the sum of weights of gray edges in P minus the weights of black edges from P -excluding the black edges e c x and e c y . Observe that we traverse the first black edge in the cycle from right to left, which means that the path that goes from e c x to e c y is different from the path that goes from e c y to e c x .

Note that f (e c x , e c y ) + f (e c y , e c x ) − w(e c x ) − w(e c y ) indeed computes the sum of weights of all gray edges minus the sum of weights of all black edges.

We follow by presenting several lemmas, that will later be combined to prove the correctness of our 3.5-approximation algorithm. Let us first present the ideas behind them. Due to space constraints, proofs of Lemmas 6-13 are omitted. However, it can be seen in Fig. 4 how (and in which case) each of these lemmas is applied, as explained in detail right after the lemmas.

The next two lemmas deal with non-trivial negative cycles. Lemmas 6 and 8 show that it is always possible to increase balanced cycles by applying one transposition on negative cycles that are non-oriented and oriented, respectively. C = (c 1 , . . . , c k ) be a non-trivial oriented cycle. If C is not positive, there is a triple (c x , c y , c z ) with c x > c z > c y and

1 ≤ x < y < z ≤ k such that 0 ≤ f (e c x , e c y ) ≤ w(c x ) + w(c y ), or 0 ≤ f (e c y , e c z ) ≤ w(c y ) + w(c z ), or 0 ≤ f (e c z , e c x ) ≤ w(c z ) + w(c x ).

Let C be an oriented long negative cycle. There is a transposition that increases the number of balanced cycles by 1 and the number of cycles by 2. Now let us explain how to deal with trivial negative cycles. We use Lemma 9 as an intermediary step to the correctness of Lemma 10, that shows how many transpositions are needed to transform trivial negative cycles into trivial balanced. The next three lemmas dealing with non-trivial balanced cycles. Lemmas 11, 12 and 13 show that it is possible to increase balanced cycles by applying transpositions on oriented long balanced cycles, non-oriented long balanced cycles, and short balanced cycles, respectively. Lemma 11. Let C be an oriented long balanced cycle. Then it is possible to increase the number of balanced cycles by two after at most three transpositions.

Lemma 13. Let G be a graph with no long cycles such that all cycles are balanced. If G has short cycles it is possible to increase the number of balanced cycles by two after two transpositions.

In Fig. 4(a) the blue cycle A = (6, 4, 2) is non-oriented and negative, so we can apply Lemma 6 using the positive cycle B = (7, 5), and the intergenic transposition ρ (2,6,7) (5,4,1) generates in Fig. 4 (b) the trivial balanced cycle C = (2), and the non-trivial cycle D = (7, 5, 3, 6) , that in this case is negative and oriented. We then can use Lemma 8 on D, and the transposition ρ (3, 6, 7) (2,1,0) generates in Fig. 4 (c) the trivial balanced cycle E = (3), the trivial cycle F = (7), and the short cycle G = (6, 4) . Note that F is negative and G is balanced. At this stage, there is no long cycle, and we can use Lemma 10 on the trivial negative cycle F . This lemma requires a positive cycle to interact with the trivial negative, and H = (8) is the only positive cycle. Since H is also trivial, we need to borrow a black edge of a balanced cycle to apply the transposition, so let us use the short balanced cycle G. Two consecutive transpositions on these black edges (see shows a sequence of intergenic transpositions that sorts the instance (π,π,ι) with π = (3 2 1 6 5 4 7), π = (2, 6, 5, 4, 1, 4, 3, 0), andι = (4, 1, 3, 5, 2, 3, 6, 1) using Lemmas 6-10 and 13. (h)-(n) shows a sequence of intergenic transpositions that sorts the instance (π,π,ι) with π = (5 4 3 2 1 6 8 7),π = (3, 7, 8, 3, 7, 4, 10, 1, 2) , andι = (3, 1, 6, 5, 9, 8, 4, 2, 7) using Lemmas 11 and 12. (Color figure online) Fig. 4(c) and (d)) generate two balanced cycles I = (7) and J = (8) in Fig. 4(e) , without modifying G. The weighted cycle graph in Fig. 4 (e) has only balanced cycles, and they are either short or trivial. In this case we can use Lemma 13 that applies two intergenic transpositions to two non-oriented short cycles, in this case G = (6, 4) and K = (5, 1). The two consecutive transpositions applied to G and K (see Fig. 4 (e) and (f)) generate four balanced cycles, increasing the number of balanced cycles by 2, and completing the sorting process -the weighted cycle graph in Fig. 4 (g) has only balanced cycles.

In Fig. 4 (h) the green cycle A = (9, 7, 8) is oriented, and since it is also balanced we can use Lemma 11. The three transpositions in Fig. 4(h-j) breaks A into three balanced cycles. In Fig. 4(k) we have only balanced non-oriented cycles, and we can use Lemma 12 that applies a transposition followed by Lemma 11 twice. The first transposition is applied over cycle B = (5, 3, 1) transforming the blue cycle in Fig. 4 (k) into an oriented balanced cycle C = (6, 2, 4). The first application of Lemma 11 breaks C into three balanced cycles (in this case, one transposition is sufficient) and also transforms the non-oriented cycle B into the oriented cycle B = (5, 1, 3) (from Fig. 4 (l) to (m)). The second application of Lemma 11 breaks B into three balanced cycles (using again only one transposition), completing the sorting process - Fig. 4(g) has only balanced cycles.

Algorithm 1 focuses on applying transpositions that increase the number of balanced cycles, following a sequence of steps using Lemmas 6, 8, 10, 11, 12, and 13.

Data: an instance (π,π,ι). Result: a sequence ρ1, ρ2, . . . , ρm such that (π,π) · ρ1 · ρ2 · . . . · ρm = (ι,ι). 1 sequence ← ∅ 2 while (π,π,ι) = (ι,ι,ι) do Let us briefly show the correctness of Algorithm 1, i.e., it stops and reaches (ι,ι,ι).

While (π,π,ι) = (ι,ι,ι) we have that one of the following must be true: (i) there is an oriented cycle. in this scenario we break this cycle if it is balanced or negative on lines 5 and 7; (ii) there is a negative cycle (considering that they are not oriented): we create balanced cycles if it is non-oriented or trivial on lines 9 and 11; and (iii) cycles are all balanced, and there is no oriented cycles. If there is a long cycle we break it at line 13, and we break short cycles at line 15.

Note that we did not care about the positive oriented cycles: they become either balanced or negative before the algorithm uses (iii), and will be handled in (i) at some point. If the algorithm reaches (iii) then all cycles are balanced since any negative cycle is handled by (i) and (ii).

Concerning the complexity of Algorithm 1, the loop of lines 2-17 iterates up to m = n+1 times. Since each time the algorithm applies one of those lemmas, it increases the number of balanced cycles by at least one. Finding which lemma to use (and at which positions the transposition takes place) requires O(n 2 ) time. Thus, the overall complexity of Algorithm 1 is O(n 3 ).

Now let us discuss about the approximation factor Algorithm 1 guarantees. Note that some of the steps of the algorithm require more than one transposition, so the approximation factor will be computed as follows: Definition 14. Let S ρ = (ρ 1 , ρ 2 , . . . , ρ d ) be a sequence of transpositions such that (π,π,ι) · S ρ = (σ,σ,ι). By Lemma 4, S ρ creates up to 2d balanced cycles. Therefore, the approximation factor is at most 2d cb (σ,σ ,ι)−cb (π ,π ,ι) .

The following lemma shows that each step of Algorithm 1 guarantees an approximation factor of 3.5 or less, which leads to the 3.5-approximation algorithm we propose.

Lemma 15. Algorithm 1 has an approximation factor of 3.5.

Proof. We use the formula from Definition 14 to calculate the approximation factor of each step.

-Step using Lemma 8 creates at least one new balanced cycle using one transposition, which leads to the maximum approximation 2 1 = 2. -Step using Lemma 11 creates at least two new balanced cycles using up to three transpositions, so its maximum approximation factor is 6 2 = 3. -Step using Lemma 6 creates a new balanced cycles using one transposition, and its approximation is 2 1 = 2. -Step using Lemma 10 it creates two new balanced cycles using two transpositions, so its approximation is 4 2 = 2. -Step using Lemma 12 creates four new balanced cycle using up to seven transpositions, and it follows that the maximum approximation is 14 4 = 3.5. -Step using Lemma 13 creates two new balanced cycles using two transpositions, so it follows that its approximation is 4 2 = 2.

We adapted the breakpoint graph to represent both gene order and intergenic sizes, and investigated properties of this new graph structure during a sorting process. As a result, we were able to design an approximation algorithm for the Sorting by Intergenic Transpositions. We also show that this problem is NP-Hard.

As future works, one can explore a problem where the probability of an intergenic region being affected by transpositions is related to its size, i.e., when genome rearrangements are more likely to cut the genome on bigger intergenic regions. One can also investigate the use of reversals and transpositions on signed permutations along with intergenic regions.

Sorting by transpositions

Genome rearrangement distances and gene order phylogeny in γ-proteobacteria

Breaking good: accounting for fragility of genomic regions in rearrangement distance estimation

Comparative genomics on artificial life

Sorting by genome rearrangements on both gene order and intergenic sizes

Sorting by transpositions is difficult

Genome rearrangements with indels in intergenes restrict the scenario space

A 1.375-approximation algorithm for sorting by transpositions

Algorithms for computing the double cut and join distance on both gene order and intergenic sizes

Transforming men into mice (polynomial algorithm for genomic distance problem)

Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement

TIBA: a tool for phylogeny inference from rearrangement data with bootstrap analysis

Super short operations on both gene order and intergenic sizes

Distancebased genome rearrangement phylogeny