Transforming Cabbage into Turnip: Polynomial Algorithm for...

27
Transforming Cabbage into Turnip: Polynomial Algorithm for Sorting Signed Permutations by Reversals SRIDHAR HANNENHALLI Bioinformatics, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania AND PAVEL A. PEVZNER University of Southern California, Los Angeles, California Abstract. Genomes frequently evolve by reversals r ( i , j ) that transform a gene order p 1 ... p i p i 11 ... p j 21 p j ... p n into p 1 ... p i p j 21 ... p i 11 p j ... p n . Reversal distance between permutations p and s is the minimum number of reversals to transform p into s. Analysis of genome rearrangements in molecular biology started in the late 1930’s, when Dobzhansky and Sturtevant published a milestone paper presenting a rearrangement scenario with 17 inversions between the species of Drosophila. Analysis of genomes evolving by inversions leads to a combinatorial problem of sorting by reversals studied in detail recently. We study sorting of signed permutations by reversals, a problem that adequately models rearrangements in small genomes like chloroplast or mitochondrial DNA. The previously suggested approximation algorithms for sorting signed permutations by reversals compute the reversal distance between permutations with an astonishing accuracy for both simulated and biological data. We prove a duality theorem explaining this intriguing performance and show that there exists a “hidden” parameter that allows one to compute the reversal distance between signed permutations in polynomial time. Categories and Subject Descriptors: F.1.3 [Computation by Abstract Devices]: Modes of Computa- tion; G.2.1 [Discrete Mathematics]: Combinatorics; J.3 [Life and Medical Sciences]: biology and genetics General Terms: Algorithms, Performance Additional Key Words and Phrases: Computational biology, genetics A preliminary version of this paper appeared in Proceedings of the 27th Annual ACM Symposium on the Theory of Computing (Las Vegas, Nev., May 29 –June 1). ACM, New York, 1995, pp. 178 –189. This work is supported by National Science Foundation (NSF) Young Investigator Award, NSF grant CCR 93-08567, NIH grant 1R01 HG00987, and DOE grant DE-FG02-94ER61919. Authors’ addresses: S. Hannenhalli, Bioinformatics, SmithKline Beecham Pharmaceuticals, King of Prussia, PA 19406-0939, e-mail: [email protected], and P. A. Pevzner, Departments of Mathematics and Computer Science, University of Southern California, DRB-155, Los Angeles, CA 90089-1113, e-mail: [email protected]. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery (ACM), Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 1999 ACM 0004-5411/99/0100-0001 $5.00 Journal of the ACM, Vol. 46, No. 1, January 1999, pp. 1–27.

Transcript of Transforming Cabbage into Turnip: Polynomial Algorithm for...

Page 1: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

Transforming Cabbage into Turnip: PolynomialAlgorithm for Sorting Signed Permutations by Reversals

SRIDHAR HANNENHALLI

Bioinformatics, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania

AND

PAVEL A. PEVZNER

University of Southern California, Los Angeles, California

Abstract. Genomes frequently evolve by reversals r(i, j) that transform a gene order p1. . .

p ip i11. . . p j21p j

. . . pn into p1. . . p ip j21

. . . p i11p j. . . pn. Reversal distance between

permutations p and s is the minimum number of reversals to transform p into s. Analysis of genomerearrangements in molecular biology started in the late 1930’s, when Dobzhansky and Sturtevantpublished a milestone paper presenting a rearrangement scenario with 17 inversions between thespecies of Drosophila. Analysis of genomes evolving by inversions leads to a combinatorial problem ofsorting by reversals studied in detail recently. We study sorting of signed permutations by reversals, aproblem that adequately models rearrangements in small genomes like chloroplast or mitochondrialDNA. The previously suggested approximation algorithms for sorting signed permutations byreversals compute the reversal distance between permutations with an astonishing accuracy for bothsimulated and biological data. We prove a duality theorem explaining this intriguing performance andshow that there exists a “hidden” parameter that allows one to compute the reversal distance betweensigned permutations in polynomial time.

Categories and Subject Descriptors: F.1.3 [Computation by Abstract Devices]: Modes of Computa-tion; G.2.1 [Discrete Mathematics]: Combinatorics; J.3 [Life and Medical Sciences]: biology andgenetics

General Terms: Algorithms, Performance

Additional Key Words and Phrases: Computational biology, genetics

A preliminary version of this paper appeared in Proceedings of the 27th Annual ACM Symposium onthe Theory of Computing (Las Vegas, Nev., May 29 –June 1). ACM, New York, 1995, pp. 178 –189.This work is supported by National Science Foundation (NSF) Young Investigator Award, NSF grantCCR 93-08567, NIH grant 1R01 HG00987, and DOE grant DE-FG02-94ER61919.Authors’ addresses: S. Hannenhalli, Bioinformatics, SmithKline Beecham Pharmaceuticals, King ofPrussia, PA 19406-0939, e-mail: [email protected], and P. A. Pevzner, Departments ofMathematics and Computer Science, University of Southern California, DRB-155, Los Angeles, CA90089-1113, e-mail: [email protected] to make digital / hard copy of part or all of this work for personal or classroom use isgranted without fee provided that the copies are not made or distributed for profit or commercialadvantage, the copyright notice, the title of the publication, and its date appear, and notice is giventhat copying is by permission of the Association for Computing Machinery (ACM), Inc. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permissionand / or a fee.© 1999 ACM 0004-5411/99/0100-0001 $5.00

Journal of the ACM, Vol. 46, No. 1, January 1999, pp. 1–27.

Page 2: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

1. Introduction

1.1. MOTIVATION AND BIOLOGICAL BACKGROUND. In the late 1980’s, JeffreyPalmer and colleagues discovered a remarkable and novel pattern of evolution-ary change in plant organelles. They compared the mitochondrial genomes ofBrassica oleracea (cabbage) and Brassica campestris (turnip), which are veryclosely related (many genes are 99%–99.9% identical). To their surprise, thesemolecules, which are almost identical in gene sequence, differ dramatically ingene order (Figure 1). This discovery and many other studies in the last decadeconvincingly proved that genome rearrangements is a common mode of molecu-lar evolution in mitochondrial, chloroplast, viral and bacterial DNA (see Bafnaand Pevzner, [1995]).

Every study of genome rearrangements involves solving a combinatorial“puzzle” to find a shortest series of reversals to transform one genome intoanother. (Three such reversals “transforming” cabbage into turnip are shown inFigure 1.) In cases of genomes consisting of small number of “conserved blocks,”Palmer and co-authors were able to find the most parsimonious scenarios forrearrangements. However, for genomes consisting of more than 10 blocks,exhaustive search over all potential solutions is far beyond the possibilities of“pen-and-pencil” methods. As a result, Palmer and Herbon [1988] and Makaroffand Palmer [1988] overlooked the most parsimonious scenarios of rearrange-ments in more complicated cases like turnip vs. black mustard or turnip vs. radish(see Bafna and Pevzner [1995] for optimal solutions).

In the problem we consider, the genes are numbered 1, . . . , n and the orderof genes in two organisms is represented by permutations p 5 (p1p2

. . . pn)and s 5 (s1s2

. . . sn). A reversal r(i, j) is the permutation

S 1 2 · · · i 2 1 i i 1 1 · · · j 2 1 j j 1 1 · · · n1 2 · · · i 2 1 j j 2 1 · · · i 1 1 i j 1 1 · · · nD .

Clearly p z r(i, j) has the effect of reversing the order of genes p ip i11. . . p j. In

the case of signed permutations with 1 or 2 signs associated with every elementof p, p z r(i, j) reverses both the order and signs of the elements p ip i11

. . . p j

(see below).Given permutations p and s, the reversal distance problem is to find a series of

reversals r1, r2, . . . , r t such that p z r1 z r2. . . r t 5 s and t is minimum. We

call t the reversal distance between p and s. Note that the reversal distance

FIG. 1. “Transformation” of cabbage into turnip. Mitochondrial DNA of cabbage and turnip arecomposed of five conserved blocks of genes that are shuffled in cabbage as compared to turnip. Everyconserved block has a direction that is shown by 1 or 2 sign.

2 S. HANNENHALLI AND P. A. PEVZNER

Page 3: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

between p and s equals the reversal distance between s21p and the identitypermutation (1 2 . . . n). Sorting p by reversals is the problem of finding thereversal distance, d(p), between p and the identity permutation (Figure 2(a)).

1.2. PREVIOUS RESULTS. Analysis of genome rearrangements provides a mul-titude of challenges for computer scientists; see Pevzner and Waterman [1995]for a review of open combinatorial problems motivated by genome rearrange-ments. A computational approach based on comparison of gene orders versustraditional comparison of DNA sequences was pioneered by Sankoff (see Sankoffet al. [1990; 1992] and Sankoff [1992]). Kececioglu and Sankoff [1995] firstformulated the reversal distance problem and derived the lower and upperbounds for reversal distance. This approach led to the first approximationalgorithm for sorting by reversals, which generated the exact solutions in anumber of difficult instances. The problem was further studied by Bafna andPevzner [1996], who introduced the notion of breakpoint graph of a permutationand revealed important links between the maximum cycle decomposition of thisgraph and reversal distance.1

1.3. BREAKPOINT GRAPH AND CYCLE DECOMPOSITION. What makes it hard tosort a permutation? In the very first computational studies of genome rearrange-ments, Watterson et al. [1982], and Nadeau and Taylor [1984] introduced thenotion of breakpoint and noticed some correlations between the reversal distanceand the number of breakpoints. (In fact, Sturtevant and Dobzhansky [1936]implicitly discussed these correlations 60 years ago!) Below, we define the notionof breakpoint.

Let i ; j, if ui 2 j u 5 1. Extend a permutation p 5 p1p2. . . pn by adding

p0 5 0 and pn11 5 n 1 1. We call a pair of elements (p i, p i11), 0 # i # n,of p an adjacency if p i ; p i11, and a breakpoint if p i ò p i11. Since the identitypermutation has no breakpoints, sorting by reversals corresponds to eliminatingbreakpoints. An observation that every reversal can eliminate at most 2 break-points immediately implies that d(p) $ (b(p)/ 2), where bp) is the number ofbreakpoints in p. However, the estimate of reversal distance in terms ofbreakpoints is very inaccurate. Bafna and Pevzner [1996] showed that there existsanother parameter (size of a maximum cycle decomposition of the breakpointgraph) that estimates reversal distance with much greater accuracy.

The breakpoint graph of a permutation p is an edge-colored graph G(p) withn 1 2 vertices {p0, p1, . . . , pn, pn11} 5 {0, 1, . . . , n, n 1 1}. We joinvertices p i and p j by a black edge if (p i, p j) is a breakpoint in p (i.e., p i ò p j

and i ; j) and by a gray edge if (i, j) is a breakpoint in p21 (i.e., p i ; p j andi ò j). See Figure 2(b).

A cycle in an edge-colored graph G is called alternating if the colors of everytwo consecutive edges of this cycle are distinct. In the following, by cycles, wemean alternating cycles. The length of a cycle C, denoted by l(C), is the number

1 See also Kececioglu and Sankoff [1994], Kececioglu and Gusfield [1994], Kececioglu and Ravi[1995], Hannenhalli [1995], Hannenhalli and Pevzner [1995, 1996], Berman and Hannenhalli [1996],Caprara [1997], Tarjan et al. [1997], and Bafna and Pevzner [1998] for recent progress on thecomputational aspects of genome rearrangements, as well as Gates and Papadimitriou [1979], Evenand Goldreich [1981], Jerrum [1985], Aigner and West [1987], Cohen and Blum [1993], and Heydariand Sudborough [1993] for studies of related combinatorial problems.

3Transforming Cabbage into Turnip

Page 4: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

of black (or equivalently, gray) edges in it. A cycle C is short if l(C) 5 2 and longif l(C) . 2. A permutation p is simple if its breakpoint graph has no long cycles.

Consider a cycle decomposition of G(p) into a maximum number c(p) ofedge-disjoint alternating cycles. For the permutation p in Figure 2(b) c(p) 5 4since G(p) can be decomposed into three short cycles (8, 9, 7, 6, 8), (8, 5, 4, 7,8), and (3, 5, 6, 4, 3) and one long cycle (0, 1, 10, 9, 2, 3, 0). Bafna and Pevzner[1996] showed that every reversal changes the parameter b(p) 2 c(p) by atmost 1 and therefore the maximum cycle decomposition provides a better boundfor the reversal distance:

d~p! $ b~p! 2 c~p! . (1)

FIG. 2. (a) Optimal sorting of a permutation (3 5 8 6 4 7 9 2 1 10 11) by 5 reversals (b) Breakpointgraph of this permutation: black edges connect adjacent vertices that are not consecutive, gray edgesconnect consecutive vertices that are not adjacent. (c) Transformation of a signed permutation intoan unsigned permutation p and the breakpoint graph G(p); (d) Interleaving graph Hp with twooriented and one unoriented component.

4 S. HANNENHALLI AND P. A. PEVZNER

Page 5: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

However, finding a maximum cycle decomposition is a difficult problem (Kece-cioglu and Sankoff [1995] gave a linear programming bound for the size of themaximal cycle decomposition). Fortunately, in the biologically relevant case ofsigned permutations, this problem is trivial. Genes are directed fragments of DNAand a sequence of n genes in a genome is represented by a signed permutation on{1, . . . , n} with 1 or 2 sign associated with every element of p. For example, agene order for B. oleracea presented in Figure 1 is modeled by a signedpermutation (11 25 14 23 12). In the signed case, every reversal of a fragmentchanges both the order and the signs of the elements within that fragment(Figure 1). We are interested in the minimum number of reversals d(p) requiredto transform a signed permutation p into the identity signed permutation (1112 . . . 1n).

1.4. NEW RESULTS. Bafna and Pevzner [1996] noted that the concept ofbreakpoint graph extends naturally to signed permutations and devised anapproximation algorithm for sorting signed permutations by reversals withperformance ratio 1.5. For signed permutations, the bound (1) approximates thereversal distance extremely well for both simulated [Kececioglu and Sankoff1994] and biological data [Bafna and Pevzner, 1995; Hannenhalli et al., 1995].Kececioglu and Sankoff [1994] observed that an average difference between thebound (1) and the exact distance is less than 1 for random permutations. Thisintriguing performance raises a question whether the bound (1) overlooked athird parameter (in addition to the number of breakpoints and the size of amaximum cycle decompositions) that would allow closing the gap between d(p)and b(p) 2 c(p). Below, we answer this question by revealing the third“hidden” parameter (number of hurdles in p) making it harder to sort apermutation. We show that

b~p! 2 c~p! 1 h~p! # d~p! # b~p! 2 c~p! 1 h~p! 1 1, (2)

where h(p) is the number of hurdles in p. Based on this result, we devise apolynomial algorithm for sorting signed permutations by reversals. This is thefirst polynomial algorithm for a realistic model of genome rearrangements.

The paper is organized as follows: In Section 2, we extend the definition ofbreakpoint graph for signed permutations and introduce the notions of orientedand unoriented cycles. In Section 3, we introduce the notion of a hurdle andprove the bound

d~p! $ b~p! 2 c~p! 1 h~p! .

Previous studies revealed that a complicated interleaving structure of long cyclesin the breakpoint graph poses major difficulties in analyzing genome rearrange-ments. To get around this problem we develop a new technique called equivalenttransformations of permutations (Section 4). The technique allows one to mimicsorting permutations with long cycles by sorting simple permutations. In Section5, we prove important structural theorems for simple permutations and make thefirst step towards proving the bound

d~p! # b~p! 2 c~p! 1 h~p! 1 1.

5Transforming Cabbage into Turnip

Page 6: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

In Section 6, we associate a partial order with every permutation and show howthis partial order is affected by reversals. The properties of this partial orderallow us to introduce safe reversals that are the key operations for clearing thehurdles in our algorithm. In Section 7, we further develop a characterization of“hard-to-sort” permutations (called fortresses) that can not be sorted in b(p) 2c(p) 1 h(p) steps and prove the duality theorem.

d~p! 5 H b~p! 2 c~p! 1 h~p! 1 1, if p is a fortressb~p! 2 c~p! 1 h~p! , otherwise.

Finally, in Section 8, we present a polynomial algorithm for sorting by reversalsbased on equivalent transformations, duality theorem and clearing the hurdles.The applications of these results are given in Hannenhalli and Pevzner [1996]where the duality theorem was used to settle two conjectures by Kececioglu andSankoff (reversals do not cut long strips and reversals do not increase the number ofbreakpoints), while the algorithm for sorting signed permutations was used toanalyze evolution of extensively rearranged plant and animal organelles.

2. Breakpoint Graph of Signed Permutation

Define a transformation from a signed permutation p of order n to an unsignedpermutation p9 of order 2n as follows: To model the directions of elements in p,replace the positive elements 1x by 2x 2 1, 2x and negative elements 2x by 2x,2x 2 1 (Figure 2(c)). We call the unsigned permutation p9 the image of thesigned permutation p. Observe that in the breakpoint graph of the image of asigned permutation, every vertex has degree at most 2.

Therefore, the cycle decomposition is unique, thus making the case of signedpermutations easier to handle. We observe that the identity signed permutationof order n maps to the identity (unsigned) permutation of order 2n, and theeffect of a reversal on p can be mimicked by a reversal on p9 thus implyingd(p) $ d(p9). In the following, by sorting of the image p9 5 (p91p92 . . . p92n)of a signed permutation p 5 (p1p2

. . . pn), we mean a sorting of p9 byreversals r(2i 1 1, 2j), which “cut” only after even positions of p9. The effect ofa reversal r(2i 1 1, 2j) on p9 can be mimicked by a reversal r(i 1 1, j) on p,thus implying that d(p) 5 d(p9) if the cuts between p92i21 and p92i areforbidden (see Bafna and Pevzner [1996] for details). In the rest of the paper, allunsigned permutations we consider are images of some signed permutations. Forconvenience, we extend the term signed permutation for unsigned permutationsp 5 (p1p2

. . . p2n) such that p2i21 and p2i are consecutive numbers for 1 #i # n.

Given an arbitrary reversal r, denote Db [ Db(p, r) 5 b(pr) 2 b(p)(increase in breakpoints), and Dc [ Dc(p, r) 5 c(pr) 2 c(p) (increase in thesize of the cycle decomposition). Bafna and Pevzner [1996] proved that for everypermutation p and reversal r, D(b 2 c) [ Db(p, r) 2 Dc(p, r) $ 21 (i.e.,every reversal reduces the parameter b(p) 2 c(p) by at most 1). We call areversal proper if D(b 2 c) 5 21.

If (p i21, p i) and (p j, p j11) are breakpoints (black edges in G(p)), we saythat reversal r(i, j) acts on black edges (p i21, p i) and (p j, p j11). r(i, j) is areversal (acting) on a cycle C of G(p) if the breakpoints (p i21, p i) and (p j,

6 S. HANNENHALLI AND P. A. PEVZNER

Page 7: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

p j11) belong to C. A gray edge g is oriented if a reversal acting on two blackedges incident to g is proper and unoriented, otherwise. For example, a gray edge(8, 9) in Figure 2(c) is oriented (since a reversal acting on black edges (8, 14) and(9, 15) destroys two breakpoints and one cycle) while a gray edges (4, 5) isunoriented. To provide an intuition for the notion of an oriented edge, we statethe following lemma:

LEMMA 1. Let (pi, pj) be a gray edge incident to black edges (pk, pi) and (pj,pl). Then (pi, pj) is oriented iff i 2 k 5 j 2 l.

PROOF. Notice that k 5 i 6 1 and l 5 j 6 1. If i 2 k 5 j 2 l, then eitherk 5 i 2 1, l 5 j 2 1 or k 5 i 1 1, l 5 j 1 1 (Figure 3(a)). Clearly D(b 2 c) 5 21,hence, the reversal acting on (pi, pj) is proper. If i 2 k Þ j 2 l, then either k 5i 2 1, l 5 j 1 1 or k 5 i 1 1, l 5 j 2 1 (Figure 3(b)). In this case, D(b) 5 0 andD(c) 5 0; hence, the reversal acting on (pi, pj) is not proper. e

A cycle in G(p) is oriented if it has an oriented gray edge and unoriented,otherwise. Cycles C and F in Figure 2(c) are oriented while cycles A, B, D, andE are unoriented. Clearly, there is no proper reversal acting on an unorientedcycle. It is easy to see that a permutation has a proper reversal iff it has anoriented cycle.

3. Interleaving Graph and Hurdles

Gray edges (p i, p j) and (pk, p t) in G(p) are interleaving if the intervals [i, j]and [k, t] overlap but neither of them contains the other. For example, edges(4, 5) and (18, 19) in Figure 2(c) are interleaving while edges (4, 5) and (22, 23)or (4, 5) and (16, 17) are noninterleaving. Cycles C1 and C2 are interleaving ifthere exist interleaving gray edges g1 [ C1 and g2 [ C2.

Let #p be the set of cycles in the breakpoint graph of a permutation p. Define

FIG. 3. (a) A proper reversals on an oriented gray edge. (b) A nonproper reversal on an unorientedgray edge.

7Transforming Cabbage into Turnip

Page 8: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

an interleaving graph Hp(#p, (p) of p with the edge set

(p 5 $~C1, C2!;C1 and C2 are interleaving cycles in G~p!% .

Figure 2(d) shows the interleaving graph Hp consisting of three connectedcomponents. The vertex set of Hp is partitioned into oriented and unorientedvertices (cycles in #p). A connected component of Hp is oriented if it has at leastone oriented vertex and unoriented otherwise. For a connected component U,define leftmost and rightmost positions of U as

Umin 5 minC[U

minp i[C

i and Umax 5 maxC[U

maxp i[C

i.

Let Extent(U) be the interval [Umin, Umax]. For example, a component Ucontaining cycles B, C and D in Figure 2(c) has leftmost vertex p2 5 6 andrightmost vertex p13 5 17; therefore, Extent(U) 5 [2, 13].

We say that a component U separates components U9, U0 in p if there exists agray edge (p i, p j) in U such that Extent(U9) , [i, j], but Extent(U0) ù [i, j] 5À. For example, the component U in Figure 4(a) separates the components U9and U0.

Let a be a partial order on a set P. An element x [ P is called a minimalelement in a if there is no element y [ P with y a x. An element x [ P is thegreatest in a if y a x for every y [ P.

Consider the set of unoriented components 8p in Hp and define the contain-ment partial order on this set, that is, U a W iff Extent(U) , Extent(W) for U,W [ 8p. A hurdle is defined as follows: An unoriented component U [ 8p thatis a minimal element in a is a hurdle, called minimal hurdle. The unorientedcomponent U [ 8p that is the greatest element in a is a hurdle, called thegreatest hurdle, if U does not separate any two minimal hurdles. Obviously, therecan be at most one greatest hurdle. We wish to emphasize that the notions ofminimal and greatest hurdles become equivalent if we circularize the linearpermutation. Let h(p) be the total number of hurdles in p. Permutation p inFigure 2(c) has one unoriented component and h(p) 5 1. Permutation in Figure4(b) has two minimal and one greatest hurdles (h(p) 5 3). Permutation inFigure 4(a) has 2 minimal and no greatest hurdles (h(p) 5 2) since the greatestunoriented component U in Figure 4(a) separates U9 and U0.

The following theorem further improves the bound (1):

FIG. 4. (a) Unoriented component U separates U9 and U0 by virtue of the edge (0, 1); (b) Hurdle Udoes not separates U9 and U0.

8 S. HANNENHALLI AND P. A. PEVZNER

Page 9: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

THEOREM 1. For arbitrary (signed) permutation p, d(p) $ b(p) 2 c(p) 1 h(p).

PROOF. Given an arbitrary reversal r, denote Dh [ Dh(p, r) 5 h(pr) 2h(p). Clearly, every reversal r acts on black edges of at most two hurdles andtherefore r “destroys” (i.e., transforms an unoriented component into an ori-ented component) at most two minimal hurdles. Note that, if r destroys twominimal hurdles in 8p, then r can not destroy the greatest hurdle in 8p (see thedefinition of the greatest hurdle). Therefore, Dh $ 22 for every reversal r.

Bafna and Pevzner [1993] proved that D(b 2 c) [ {21, 0, 1} (Figure 5). IfD(b 2 c) 5 21, then r acts on an oriented cycle and hence it does not destroyany hurdles in p. Therefore, Dh $ 0 and D(b 2 c 1 h) [ Db 2 Dc 1 Dh $21. If D(b 2 c) 5 0, then r acts on a cycle and therefore it affects at most onehurdle. It implies Dh $ 21 and D(b 2 c 1 h) $ 21. If D(b 2 c) 5 1, thenD(b 2 c 1 h) $ 21 since Dh $ 22 for every reversal r.

Therefore, for an arbitrary reversal r, D(b 2 c 1 h) $ 21 thus implyingd(p) $ b(p) 2 c(p) 1 h(p). e

In the following, we show that the lower bound d(p) $ b(p) 2 c(p) 1 h(p)is very tight. As the first step towards the upper bound d(p) # b(p) 2 c(p) 1h(p) 1 1, we develop the technique called equivalent transformations ofpermutations.

4. Equivalent Transformations of Permutations

Previous studies revealed that complicated interleaving structure of long cycles inthe breakpoint graphs poses serious difficulties in analyzing sorting by reversals

FIG. 5. (A) For reversals acting on two cycles, D(b 2 c) 5 1. (B) For reversals acting on anunoriented cycle, D(b 2 c) 5 0. (C) For reversals acting on an oriented cycle, D(b 2 c) 5 21.

9Transforming Cabbage into Turnip

Page 10: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

[Bafna and Pevzner 1996] and by transpositions [Bafna and Pevzner 1995]. Toget around this problem, we introduce equivalent transformations of permuta-tions based on the following idea. If a permutation p [ p(0) has a long cycle,transform it into a new permutation p(1) by “breaking” this long cycle into twosmaller cycles. Continue with p(1) in the same manner and form a sequence ofpermutations p [ p(0), p(1), . . . , p(k) [ s ending with a simple permuta-tion (i.e., one having no long cycles). In this section, we show that thesetransformations can be arranged in such a way that every sorting of s mimics asorting of p with the same number of reversals. In the following sections, weshow how to optimally sort simple permutations. Optimal sorting of the simplepermutation s mimics optimal sorting of the arbitrary permutation p leading to apolynomial algorithm for sorting by reversals.

Let b 5 (vb, wb) be a black edge and g 5 (wg, vg) be a gray edge belongingto a cycle C 5 . . . , vb, wb, . . . , wg, vg, . . . in the breakpoint graph G(p) of apermutation p. A ( g, b)-split of G(p) is a new graph G(p) obtained from G(p)by

—removing edges g and b,—adding two new vertices v and w,—adding two new black edges (vb, v) and (w, wb),—adding two new gray edges (wg, w) and (v, vg).

Figure 6 shows a ( g, b)-split transforming a cycle C in G(p) into cycles C1 andC2 in G(p). If G(p) is a breakpoint graph of a signed permutation p, then every( g, b)-split of G(p) corresponds to the breakpoint graph of a signed generalizedpermutation p such that G(p) 5 G(p). Below, we define generalized permuta-tions and describe the padding procedure to find a generalized permutation pcorresponding to a ( g, b)-split of G.

A generalized permutation p 5 (p1p2. . . pn) is a permutation of arbitrary

distinct reals (versus permutations of integers {1, 2, . . . , n} we consideredbefore). In this section, by permutations, we mean generalized permutations, andby identity generalized permutation we mean a generalized permutation p 5(p1p2

. . . pn) with p i , p i11 for 1 # i # n 2 1. Extend a permutation p 5(p1p2

. . . pn) by adding p0 5 min1#i#n p i 2 1 and pn11 5 max1#i#n p i 1 1.Elements p j and pk of p are consecutive if there is no element p l such that p j ,p l , pk for 1 # l # n. Elements p i and p i11 of p are adjacent for 0 # i # n.The breakpoint graph of a (generalized) permutation p 5 (p1p2

. . .pn) isdefined as the graph on vertices {p0, p1, . . . , pn, pn11} with black edgesbetween adjacent elements that are not consecutive and gray edges betweenconsecutive elements that are not adjacent. Obviously the definition of thebreakpoint graph for generalized permutation is consistent with the notion of thebreakpoint graph described earlier.

FIG. 6. Example of a ( g, b)-split.

10 S. HANNENHALLI AND P. A. PEVZNER

Page 11: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

Let b 5 (p i11, p i) be a black edge and g 5 (p j, pk) be a gray edge belongingto a cycle C 5 . . . , p i11, p i, . . . , p j, pk, . . . in the breakpoint graph G(p).Define D 5 pk 2 p j and let v 5 p j 1 (D/3), w 5 pk 2 (D/3). A( g, b)-padding of p 5 (p1p2

. . .pn) is a permutation on n 1 2 elementsobtained from p by inserting v and w after the ith element of p (0 # i # n):

p 5 ~p1p2 · · · p ivwp i11 · · · pn! .

Note that v and w are both consecutive and adjacent in p thus implying that if pis (the image of) a signed permutation then p is also (the image of) a signedpermutation. The ( g, b)-split in Figure 6 corresponds to ( g, b)-padding for g 5(wg, vg) and b 5 (vb, wb). The following lemma establishes the correspondencebetween ( g, b)-paddings and ( g, b)-splits.2

LEMMA 2. G(p) 5 G(p).

If g and b are nonincident edges of a long cycle C in G(p), then the( g, b)-padding breaks C into two smaller cycles in G(p). Therefore, paddingsmay be used to transform an arbitrary permutation p into a simple permutation.Note that b(p) 5 b(p) 1 1 and c(p) 5 c(p) 1 1. Below, we prove that, forevery permutation with a long cycle, there exists a padding on nonincident edgesof this cycle such that h(p) 5 h(p), thus indicating that padding provides a wayto eliminate long cycles in a permutation without changing the parameterb(p) 2 c(p) 1 h(p). First, we need a series of technical lemmas.

LEMMA 3. Let a (g, b)-padding on a cycle C in G(p) delete the gray edge g andadd two new gray edges g1 and g2. If g is oriented, then either g1 or g2 is oriented inG(p). If C is unoriented, then both g1 and g2 are unoriented in G(p).

The case when g is oriented is illustrated in Figure 7.

LEMMA 4. Let a (g, b)-padding break a cycle C in G(p) into cycles C1 and C2in G(p). Then C is oriented iff either C1 or C2 is oriented.

PROOF. Note that a ( g, b)-padding preserves orientation of gray edges inG(p), which are “inherited” from G(p) (Lemma 1). If C is oriented, then it hasan oriented gray edge. If this edge is different from g, then it remains oriented ina ( g, b)-padding of p and therefore a cycle (C1 or C2) containing this edge isoriented. If g 5 (wg, vg) is the only oriented gray edge in C, then ( g, b)-padding

2 Of course, a ( g, b)-padding of a permutation p 5 (p1p2. . . pn) on {1, 2, . . . , n} can be

modeled as a permutation p 5 (p1p2. . . p ivwp i11

. . . pn) on {1, 2, . . . , n 1 2} where v 5 p j 11, w 5 pk 1 1, p i 5 p i 1 2 if p i . min{p j, pk} and p i 5 p i otherwise. The generalizedpermutations were introduced to make the following “mimicking” procedure more intuitive.

FIG. 7. A ( g, b)-padding deletes an oriented edge g and adds an oriented edge g1 and unorientededge g2.

11Transforming Cabbage into Turnip

Page 12: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

adds two new gray edges ((wg, w) and (v, vg)) to G(p), one of which is oriented(Lemma 3). Therefore, a cycle (C1 or C2) containing this edge is oriented.

If C is an unoriented cycle, then all edges of C1 and C2 “inherited” from Cremain unoriented. Lemma 3 implies that new edges ((wg, w) and (v, vg)) in C1and C2 are also unoriented. e

The following lemma shows that paddings preserve interleaving of gray edges:

LEMMA 5. Let g9 and g0 be two gray edges of G(p) different from g. Then g9and g0 are interleaving in p iff g9 and g0 are interleaving in a (g, b)-padding of p.

This lemma immediately implies the following:

LEMMA 6. Let a (g, b)-padding breaks a cycle C in G(p) into cycles C1 and C2in G(p). Then every cycle D interleaving with C in G(p) interleaves with either C1 orC2 in G(p).

PROOF. Let d [ D and c [ C be interleaving gray edges in G(p). If c isdifferent from g, then Lemma 5 implies that d and c are interleaving in G(p)and therefore D interleaves with either C1 or C2. If c 5 g, then it is easy to seethat one of the new gray edges in G(p) interleaves with d and therefore Dinterleaves with either C1 or C2 in G(p). e

LEMMA 7. For every gray edge g , there exists a gray edge f interleaving with g inG(p).

PROOF. Let g 5 (p i, p j). Gray edges and adjacencies in the breakpointgraph form a path from vertex 0 to n 1 1 that visits all vertices in G(p). Inparticular, this path visits the interval i 1 1, . . . , j 2 1 between the endpointsof g and thus contains a gray edge f connecting a vertex from this interval withthe “outside” of this interval (i.e., the set {0, 1, . . . , i 2 1, j 1 1, . . . , n 11}). e

LEMMA 8. Let C be a cycle in G(p) and g [y C be a gray edge in G(p). Then ginterleaves with an even number of gray edges in C.

PROOF. Let g 5 (p i, p j). The gray edges of cycle C enter and leave theinterval i 1 1, . . . , j 2 1 the same number of times (to leave a room, one firstneeds to enter it). e

A ( g, b)-padding f transforming p into p (i.e., p 5 p z f) is safe if it acts onnonincident edges of a long cycle and h(p) 5 h(p). Clearly, every safe paddingbreaks a long cycle into two smaller cycles.

THEOREM 2. If C is a long cycle in G(p), then there exists a safe (g, b)-paddingacting on C.

PROOF. If C has a pair of interleaving gray edges g1, g2 [ C, then removingthese edges transforms C into two paths. Since C is a long cycle at least one ofthese paths contains a gray-edge g. Pick a black-edge b from the other path andconsider the ( g, b)-padding transforming p into p (clearly, g and b are noninci-dent edges). This ( g, b)-padding breaks C into cycles C1 and C2 in G(p) with g1and g2 belonging to different cycles C1 and C2. By Lemma 5, g1 and g2 areinterleaving, thus implying that C1 and C2 are interleaving. Also this ( g, b)-padding does not “break” the component K in Hp containing the cycle C since by

12 S. HANNENHALLI AND P. A. PEVZNER

Page 13: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

Lemma 6 all cycles from K belong to the component of Hp containing C1 andC2. Moreover, according to Lemma 4 the orientation of this component in Hp

and Hp is the same. Therefore, the chosen ( g, b)-padding preserves the set ofhurdles and h(p) 5 h(p).

If all gray edges of C are mutually noninterleaving, then C is an unorientedcycle (refer to Figure 3). Lemmas 7 and 8 imply that there exists a gray edge e [C9 interleaving with at least two gray edges g1, g2 [ C. Removing g1 and g2transforms C into two paths and since C is a long cycle at least one of these pathscontains a gray edge g. Pick a black-edge b from the other path and consider the( g, b)-padding of p. This padding breaks C into cycles C1 and C2 in G(p) withg1 and g2 belonging to different cycles C1 and C2. By Lemma 5, both C1 and C2interleave with C9 in p. Therefore, this ( g, b)-padding does not break thecomponent K in Hp containing C and C9. Moreover, according to Lemma 4,both C1 and C2 are unoriented thus implying that the orientation of thiscomponent in Hp and Hp is the same. Therefore, the chosen ( g, b)-paddingpreserves the set of hurdles and hence, h(p) 5 h(p). e

A permutation p is equivalent to a permutation s (p V s) if there exists aseries of permutations p [ p(0), p(1), . . . , p(k) [ s such that p(i 1 1) 5p(i) z f(i) for a safe ( g, b)-padding f(i) acting on p i (0 # i # k 2 1).

THEOREM 3. For every permutation, there exists an equivalent simple permuta-tion.

PROOF. Define the complexity of a permutation p as ¥C[#p(l(C) 2 2)

where #p is the set of cycles in G(p) and l(C) is the length of a cycle C. Thecomplexity of a simple permutation is 0. Note that every padding on nonincidentedges of a long cycle C breaks C into cycles C1 and C2 with l(C) 5 l(C1) 1l(C2) 2 1. Therefore,

~l~C! 2 2! 5 ~l~C1! 2 2! 1 ~l~C2! 2 2! 1 1,

implying that a padding on nonincident edges of a cycle reduces the complexityof permutations. This observation and Theorem 2 imply that every permutationwith long cycles can be transformed into a permutation without long cycles by aseries of paddings preserving b(p) 2 c(p) 1 h(p). e

Let p be a ( g, b)-padding of p and r be a reversal acting on two black edges ofp. Then r can be mimicked on p by ignoring the padded elements. We need ageneralization of this observation.

A sequence of permutations p [ p(0), p(1), . . . , p(k) [ s is called ageneralized sorting of p if s is the identity (generalized) permutation and p(i 11) is obtained from p(i) either by a reversal or by a padding. Note that reversalsand paddings in generalized sorting of p may interleave. Interleaving of reversalsand paddings in generalized sorting is necessary to mimic sorting of the(genuine) permutation since a reversal may merge two short cycles into a longone.

LEMMA 9. Every generalized sorting of p mimics a ( genuine) sorting of p withthe same number of reversals.

13Transforming Cabbage into Turnip

Page 14: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

PROOF. Ignore padded elements. e

In the following, we show how to find a generalized sorting of a permutation pby a series of paddings and reversals containing d(p) reversals. Lemma 9 impliesthat this generalized sorting of p mimics an optimal (genuine) sorting of p.

5. Safe Reversals in Oriented Components

Recall that for an arbitrary reversal, D(b 2 c 1 h) $ 21 (see proof ofTheorem 1). A reversal r is safe if D(b 2 c 1 h) 5 21. In the following, weprove the existence of a safe reversal acting on a cycle in an oriented componentby analyzing actions of reversals on simple permutations. In this section, bycycles, we mean short cycles and by permutations we mean simple permutations.

Denote the set of all cycles interleaving with a cycle C in G(p) as V(C) (i.e.,V(C) is the set of vertices adjacent to C in Hp). Define the sets of edges in thesubgraph of Hp induced by V(C)

E~C! 5 $~C1, C2!;C1, C2 [ V~C! and C1 interleaves with C2 in p%

and its complement

E# ~C! 5 $~C1, C2!;C1, C2 [ V~C! and C1 does not interleave with C2 in p% .

A reversal r acting on an oriented (short) cycle C “destroys” C (i.e., removes theedges of C from G(p)) and transforms every other cycle in G(p) into acorresponding cycle on the same vertices in G(pr). As a result r transforms theinterleaving graph Hp(#p, (p) of p into the interleaving graph Hpr(#p\C, (pr)of pr. This transformation results in complementing the subgraph induced byV(C) as described by the following lemma (Figure 8). We denote (# p 5(p\{(C, D);D [ V(C)}.

LEMMA 10. Let r be a reversal acting on an oriented (short) cycle C. Then

—(pr 5 ((# p\E(C)) ø E# (C), that is r removes edges E(C) and adds edges E# (C)to transform Hp into Hpr.

—r changes the orientation of a cycle D [ #p iff D [ V(C).

Lemma 10 immediately implies the following:

LEMMA 11. Let r be a reversal acting on a cycle C and A, B be nonadjacentvertices in Hpr. Then ( A, B) is an edge in Hp iff A, B [ V(C).

Let K be an oriented component of Hp and let 5(K) be a set of reversalsacting on oriented cycles from K. Assume that a reversal r [ 5(K) “breaks” Kinto a number of connected components K1(r), K2(r), . . . in Hpr and the firstm of these components are unoriented. If m . 0, then r may be unsafe sincesome of the components K1(r), . . . , Km(r) may form new hurdles in pr thusincreasing h(pr) as compared to h(p). In the following, we show that there is aflexibility in choosing a reversal from the set 5(K) allowing one to substitute asafe reversal s for an unsafe reversal r.

14 S. HANNENHALLI AND P. A. PEVZNER

Page 15: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

LEMMA 12. Let r and s be the reversals acting on two interleaving orientedcycles C and C9 in G(p), respectively. If C9 belongs to an unoriented componentK1(r) in Hpr then

—every two vertices outside K1(r) which are adjacent in Hpr are also adjacent inHps.

—orientation of vertices outside K1(r) does not change in Hps as compared to Hpr.

PROOF. Let D, E be two vertices outside K1(r) connected by an edge in Hpr.If one of these vertices, say D, does not belong to V(C) in Hp, then Lemma 11implies (i) (C9, D) is not an edge in Hp and (ii) (D, E) is an edge in Hp.Therefore, by Lemma 10, reversal s preserves the edge (D, E) in Hps. If bothvertices D and E belong to V(C), then Lemma 10 implies that (D, E) is not anedge in Hp. Since vertex C9 and vertices D, E are in different components ofHpr, Lemma 11 implies that (C9, D) and (C9, E) are edges in Hp. Therefore, byLemma 10, (D, E) is an edge in Hps. In both cases, s preserves the edge (D, E)in Hps and the first part of the lemma holds.

Lemma 11 implies that for every vertex D outside K1(r), D [ V(C) iff D [V(C9). This observation and Lemma 10 imply that the orientation of verticesoutside K1(r) does not change in Hps as compared to Hpr. e

LEMMA 13. Every unoriented component in the interleaving graph (of a simplepermutation) contains at least 2 vertices.

PROOF. By Lemma 7, every gray edge in G(p) has an interleaving gray edge.Therefore every unoriented (short) cycle in G(p) has an interleaving cycle. e

FIG. 8. Reversal on a cycle C (i) deletes vertex C from the interleaving graph; (ii) changes theorientation of vertices in V(C); (iii) complements the subgraph induced by V(C).

15Transforming Cabbage into Turnip

Page 16: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

THEOREM 4. For every oriented component K in Hp there exists a (safe) reversalr [ 5(K) such that all components K1(r), K2(r), . . . are oriented in Hpr.

PROOF. Assume that a reversal r [ 5(K) “breaks” K into a number ofconnected components K1(r), K2(r), . . . in Hpr and the first m of thesecomponents are unoriented. Denote the overall number of vertices in theseunoriented components as index (r) 5 ¥ i51

m uKi(r) u where uKi(r) u is the numberof vertices in Ki(r). Let r be a reversal such that

index~r! 5 mins[5~K!

index~s! .

This reversal acts on a cycle C and breaks K into a number of components. If allthese components are oriented (i.e., index(r) 5 0), the theorem holds. Otherwise,index(r) . 0 and let K1(r), . . . , Km(r) (m $ 1) be unoriented components inHpr. Below, we find another reversal s [ 5(K) with index(s) , index(r), acontradiction.

Let V1 be the set of vertices of the component K1(r) in Hpr. Note that K1(r)contains at least one vertex from V(C) and consider the (nonempty) set V 5V1 ù V(C) of vertices from component K1(r) adjacent to C in Hp. Since K1(r)is an unoriented component in pr all cycles from V are oriented in p and allcycles from V1\V are unoriented in p (Lemma 10). Let C9 be an (oriented) cyclein V and let s be the reversal acting on C9 in G(p). Lemma 12 implies that fori $ 2 all edges of the component Ki(r) in Hpr are preserved in Hps and theorientation of vertices in Ki(r) does not change in Hps as compared to Hpr.Therefore all unoriented components Km11(r), Km12(r), . . . of pr “survive” inps and

index~s! # index~r! .

Below, we prove that there exists a reversal s acting on a cycle from V such thatindex(s) , index(r), a contradiction.

If V1 Þ V, then there exists an edge between an (oriented) cycle C9 [ V andan (unoriented) cycle C0 [ V1\V in (p. Lemma 10 implies that a reversal sacting on C9 in p orients the cycle C0 in G(sp). This observation and Lemma 12imply that s reduces index(s) by at least 1 as compared to index(r), a contradic-tion (refer to Figure 9(a)).

If V1 5 V (all cycles of K1 interleave with C), then there exist at least twovertices in V(C) (Lemma 13). Moreover, there exist (oriented) cycles C9, C0 [V1 such that (C9, C0) are not interleaving in p (otherwise, Lemma 10 wouldimply that K1(r) is a graph with no edges, a contradiction to connectivity ofK1(r)). Define s as a reversal acting on C9. Lemma 10 implies that s preservesthe orientation of C0 thus reducing index(s) by at least 1 as compared toindex(r), a contradiction (refer to Figure 9(b)).

The above discussion implies that there exists a reversal r [ 5(K) such thatindex(r) 5 0, that is, r does not create new unoriented components. Therefore,Db(p, r) 5 22, Dc(p, r) 5 21 and Dh(p, r) 5 0 implying that r is safe. e

6. Clearing the Hurdles

If p has an oriented component, then Theorem 4 implies that there exists a safereversal in p. In this section, we search for a safe reversal in the absence of any

16 S. HANNENHALLI AND P. A. PEVZNER

Page 17: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

oriented component. Let a be a partial order on a set P. We say x is covered byy in P if x a y and there is no element z [ P for which x a z a y. The covergraph V of a is an (undirected) graph with vertex set P and edge set {( x, y);x,y [ P and x is covered by y}.

Let 8p be the set of unoriented components in Hp and let Extent(U) 5 [Umin,Umax] be the interval between the leftmost and rightmost positions in anunoriented component U [ 8p (see Section 3). Define U# min 5 minU[8p

Umin,U# max 5 maxU[8p

Umax and let [U# min, U# max] be the interval between theleftmost and rightmost positions among all the unoriented components of p. LetU# be an (artificial) component associated with the interval [U# min, U# max].

Define 8# p as the set of u8pu 1 1 elements consisting of u8pu elements {U;U [8p} combined with an additional element U# . Let a[ap be the containmentpartial order on 8# p defined by the rule U a W iff Extent(U) , Extent(W) for U,W [ 8# p. If there exists the greatest unoriented component U in p (i.e.,Extent(U) 5 [U# min, U# max]), we assume that there exist two elements (“real”

FIG. 9. Proof of Theorem 4. A reversal on a cycle C9 has a smaller index than a reversal on a cycleC.

17Transforming Cabbage into Turnip

Page 18: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

component U and “artificial” component U# ), corresponding to the greatestinterval and that U ap U# . Let Vp be the tree representing the cover graph of thepartial order ap on hk# p (Figure 10(a)). Every vertex in Vp but U# is associatedwith an unoriented component in 8p. In the case, p has the greatest hurdle weassume that the leaf U# is associated with this greatest hurdle (i.e., in this case,there are two vertices corresponding to the greatest hurdle, leaf U# and itsneighbor, the greatest hurdle U [ 8p). Every leaf in Vp, corresponding to aminimal element in ap, is a hurdle. In the case U# is a leaf in Vp, it is notnecessarily a hurdle (e.g., U# is a leaf in Vp but not a hurdle for a permutation pshown in Figure 4(a)). Therefore, the number of leaves in Vp coincides with thenumber of hurdles h(p), except for the cases when3

—there exists only one unoriented component in p (in this case, Vp consists oftwo copies of this component and has two leaves while h(p) 5 1)

—there exists the greatest element in 8p, which is not a hurdle, that is, thiselement separates other hurdles (in this case the number of leaves equalsh(p) 1 1).

3 Although an addition of an “artificial” component U# might seem unnecessary, we will find belowthat such an addition greatly facilitates the analysis of technical details.

FIG. 10. (a) A cover graph Vp of a permutation p with “real” unoriented components K, L, M, N,P, U and an “artificial” component U# ; (b) A reversal r merging hurdles L and M in p transformsunoriented components L, K, P and M into an oriented component which “disappears from Vpr.This reversal transforms unoriented cycles (32, 33, 36, 37, 32) and (10, 11, 14, 15, 10) in p into an orientedcycle (15, 14, 11, 10, 32, 33, 36, 37, 15) in pr. LCA(L, M) 5 LCA(L, M) 5 U and PATH( A, F) 5 {L, K,U, P, M}.

18 S. HANNENHALLI AND P. A. PEVZNER

Page 19: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

LEMMA 14. (HURDLE CUTTING). Every reversal r on a cycle in a hurdle K cutsoff the leaf K from the cover graph of p, that is, Vpr 5 Vp\K.

PROOF. If r acts on an unoriented cycle of a component K in p, then Kremains “unbroken” in pr. Also Lemma 7 implies that every reversal on an(unoriented) cycle of an (unoriented) component K orients at least one cycle inK. Therefore, r transforms K into an oriented component in pr and deletes theleaf K from the cover graph. e

A hurdle K [ 8p protects a nonhurdle U [ 8p if deleting K from 8p

transforms U from a nonhurdle into a hurdle (i.e., U is a hurdle in 8p\K). Ahurdle in p is a superhurdle if it protects a nonhurdle U [ 8p and a simplehurdle, otherwise. Components M, N, and U in Figure 10(a) are simple hurdleswhile component L is a superhurdle (deleting L transforms a nonhurdle K into ahurdle). In Figure 11(a), all three hurdles are superhurdles while in Figure 11(b)there are two superhurdles and one simple hurdle (note that the cover graphs inFigure 11(a) and Figure 11(b) are the same!). The following lemma immediatelyfollows from the definition of a simple hurdle.

LEMMA 15. A reversal acting on a cycle of a simple hurdle is safe.

PROOF. Lemma 14 implies that for every reversal r acting on a cycle of asimple hurdle, b(p) 5 b(pr), c(p) 5 c(pr) and h(pr) 5 h(p) 2 1 implyingthat r is safe. e

FIG. 11. Permutation in (a) is a 3-fortress while permutation in (b) with the same cover graph is nota fortress (hurdle U is not a superhurdle since deleting U leaves a greatest component that separateshurdles L and M).

19Transforming Cabbage into Turnip

Page 20: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

Unfortunately, a reversal acting on a cycle of a superhurdle is unsafe since ittransforms a nonhurdle into a hurdle implying D(b 2 c 1 h) 5 0. Below, wedefine a new operation (hurdles merging) allowing one to search for safereversals even in the absence of simple hurdles.

If L and M are two hurdles in p, define PATH(L, M) as the set of(unoriented) components on the (unique) path from the leaf L to the leaf M inthe cover graph Vp. If both L and M are minimal elements in a, defineLCA(L, M) as an (unoriented) component that is the least common ancestor ofL and M and define LCA(L, M) as the least common ancestor of L and M thatdoes not separate L and M. Obviously, LCA(L, M) is either LCA(L, M) or itsparent. If L corresponds to the greatest hurdle U, there are two elements U andU# in 8# p corresponding to the same (greatest) interval [Umin, Umax] 5 [U# min,U# max]. In this case, define LCA(L, M) 5 LCA(L, M) 5 U. Let G(V, E) be agraph, w [ V and W , V. A contraction of W into w in G is defined as a newgraph with vertex set V\(W\w) and edge set {( p( x), p( y));( x, y) [ E}, wherep(v) 5 w if v [ W and p(v) 5 v, otherwise. Note that, if w [ W, then acontraction reduces the number of vertices in G by uW u 2 1, while, if w [y W,the number of vertices is reduced by uW u.

Let L and M be two hurdles in p and Vp be the cover graph of p. We defineVp(L, M) as the graph obtained from Vp by the contraction of PATH(L, M)into LCA(L, M) (loops in Vp(L, M) are ignored). Note that in the caseLCA(L, M) 5 LCA(L, M), Vp(L, M) corresponds to deleting the elements ofthe set PATH(L, M)\LCA(L, M) from the partial order ap while in the caseLCA(L, M) Þ LCA(L, M), Vp(L, M) corresponds to deleting the entire setPATH(L, M) from ap.

LEMMA 16. (HURDLES MERGING). Let p be a permutation with cover graph Vp

and let r be a reversal acting on black edges of (different) hurdles L and M in p.Then, r acts on Vp as the contraction of PATH(L, M) into LCA(L, M), that is,Vpr 5 Vp(L, M).

PROOF. The reversal r acts on black edges of the cycles C1 [ L and C2 [ Min G(p) and transforms C1 and C2 into an oriented cycle C in G(pr) (Figure10). It is easy to verify that every cycle interleaving with C1 or C2 in G(p)interleaves with C in G(pr). It implies that r transforms hurdles L and M in pinto parts of an oriented component in pr and, therefore, L and M “disappear”from Vpr.

Moreover, every (unoriented) component from PATH(L, M)\LCA(L, M) hasat least one cycle interleaving with C in G(pr). It implies that every suchcomponent in p becomes a part of an oriented component in pr and therefore“disappears” from Vpr. Every component from 8p\PATH(L, M) remains unori-ented in pr. Component LCA(L, M) remains unoriented iff LCA(L, M) 5LCA(L, M). Every component that is covered by a vertex from PATH(L, M) inap will be covered by LCA(L, M) in apr. e

We write U , W for hurdles U and W if the rightmost position of U is smallerthan the rightmost position of W, that is, Umax , Wmax. Order the hurdles of pin the increasing order of their rightmost positions

U~1! , · · · , U~l ! ; L , · · · , U~m! ; M , · · · , U~h~p!!

20 S. HANNENHALLI AND P. A. PEVZNER

Page 21: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

and define the sets of hurdles

BETWEEN~L, M! 5 $U~i!;l , i , m% and OUTSIDE~L, M!

5 $U~i!;i [y @l, m#% .

Notice that any hurdle from BETWEEN(L, M) must lie sandwiched between Land M.

LEMMA 17. Let r be a reversal merging hurdles L and M in p. If both sets ofhurdles BETWEEN(L, M) and OUTSIDE(L, M) are nonempty, then r is safe.

PROOF. Let U9 [ BETWEEN(L, M) and U0 [ OUTSIDE(L, M). Lemma16 implies that the reversal r deletes the hurdles L and M from Vp. There is alsoa “danger” that r adds a new hurdle K in pr by transforming K from a nonhurdlein p into a hurdle in pr. If it is the case, K does not separate L and M in p(otherwise, by Lemma 16, K would be deleted from pr). Without loss ofgenerality, we assume that L , U9 , M.

If K is a minimal hurdle in pr, then either L ap K or M ap K (otherwise, Kwould be a hurdle in p). Since K does not separate L and M in p, it implies thatL ap K and M ap K. Since U9 is sandwiched between L and M, it implies thatU9 ap K. Thus, U9 apr K, a contradiction to minimality of K in pr.

If K is the greatest hurdle in pr, then either L, M Îp K or L, M ap K (if,otherwise, L Îp K and M ap K, then, according to Lemma 16, K would bedeleted from pr). If L, M Îp K, then L , U9 ap K , M, that is, K issandwiched between L and M. Therefore, U0 lies outside K in p and U0 Îpr K,a contradiction. If L, M ap K, then, since K is a nonhurdle in p, K separates L,M from another hurdle N. Therefore, K separates U9 from N. Since both N andU9 “survive” in pr, it implies that K separates N and U9 in pr, a contradiction.

Therefore, r deletes the hurdles L and M from Vp and does not add a newhurdle in pr, thus implying that Dh 5 22. Since b(pr) 5 b(p) and c(pr) 5c(p) 2 1, D(b 2 c 1 h) 5 21 and the reversal r is safe. e

LEMMA 18. If h(p) . 3, then there exists a safe reversal merging two hurdles inp.

PROOF. Order h(p) hurdles of p in the increasing order of their rightmostpositions and let L and M be the first and 1 1 (h(p)/ 2)-th hurdles in thisorder. Since h(p) . 3, both sets BETWEEN(L, M) and OUTSIDE(L, M) arenonempty and by Lemma 17, the reversal r merging L and M is safe. e

LEMMA 19. If h(p) 5 2, then there exists a safe reversal merging two hurdles inp. If h(p) 5 1, then there exists a safe reversal cutting the only hurdle in p.

PROOF. If h(p) 5 2, then Vp is either a path graph or contains the greatestcomponent separating two hurdles in p. In both cases, merging the hurdles in pis a safe reversal (Lemma 16). If h(p) 5 1, then Lemma 14 provides a safereversal cutting the only hurdle in p. e

The previous lemmas show that hurdles merging provides a way to find safereversals even in the absence of simple hurdles. On the negative note, hurdlesmerging does not provide a way to transform a superhurdle into a simple hurdle.

21Transforming Cabbage into Turnip

Page 22: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

LEMMA 20. Let r be a reversal in p merging two hurdles L and M. Then everysuperhurdle in p (different from L and M) remains a superhurdle in pr.

PROOF. Let U be a superhurdle in p (different from L and M) protecting anonhurdle U9. Clearly if U9 is a minimal hurdle in 8p\U, then U remains asuperhurdle in pr. If U9 is the greatest hurdle in 8p\U, then U9 does notseparate any hurdles in 8p\U. Therefore, U9 does not belong to PATH(L, M)and hence “survives” in pr (Lemma 16). It implies that U9 remains protected byU in pr. e

7. Fortresses

Lemmas 18 and 19 imply that unless Vp is a homeomorph of the 3-star (a graphwith three edges incident on the same vertex) there exists a safe reversal in p. Onthe other hand, if at least one hurdle in p is simple, then Lemma 15 implies thatthere exists a safe reversal in p. Therefore, the only case in which a safe reversalmight not exist is when Vp is a homeomorph of 3-star with three superhurdles,called a 3-fortress (Figure 11(a)).

LEMMA 21. If r is a reversal destroying a 3-fortress p (i.e. pr is not a 3-fortress),then r is unsafe.

PROOF. Every reversal on a permutation p can reduce h(p) by at most 2 andthe only operation that can reduce the number of hurdles by 2 is merging ofhurdles. On the other hand, Lemma 16 implies that merging of hurdles in a3-fortress can reduce h(p) by at most 1. Therefore, Dh $ 21. Note that, forevery reversal that does not act on edges of the same cycle, D(b 2 c) 5 1 (Db 50, Dc 5 21 if r acts on breakpoints of different cycles, Db 5 1, Dc 5 0 if r actson a breakpoint and an adjacency and Db 5 2, Dc 5 1 if r acts on twoadjacencies) and therefore every reversal that does not act on edges of the samecycle in a 3-fortress is unsafe.

If r acts on a cycle in an unoriented component of a 3-fortress, then it does notreduce the number of hurdles. Since D(b 2 c) 5 0 for a reversal on anunoriented cycle, r is unsafe.

If r acts on a cycle in an oriented component of a 3-fortress, then it does notdestroy any unoriented components in p and, does not reduce the number ofhurdles. If r increases the number of hurdles, then Dh $ 1 and D(b 2 c) $ 21imply that r is unsafe. If the number of hurdles in pr remains the same, thenevery superhurdle in p remains a superhurdle in pr, thus implying that pr is a3-fortress, a contradiction. e

LEMMA 22. If p is a 3-fortress, then d(p) 5 b(p) 2 c(p) 1 h(p) 1 1.

PROOF. Lemma 21 implies that every sorting of 3-fortress contains at leastone unsafe reversal. Therefore, d(p) $ b(p) 2 c(p) 1 h(p) 1 1.

If p has oriented cycles, all oriented components in p can be destroyed by safepaddings (Theorem 2) and safe reversals in oriented components (Theorem 4)without affecting unoriented components.

If p is a 3-fortress without oriented cycles, then an (unsafe) reversal r mergingarbitrary hurdles in p leads to a permutation pr with two hurdles (Lemma 16).Once again, oriented cycles appearing in pr after such merging can be destroyed

22 S. HANNENHALLI AND P. A. PEVZNER

Page 23: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

by safe paddings and safe reversals in oriented components (Theorems 2 and 4)leading to a permutation s with h(s) 5 2. Theorems 2 and 4 and Lemma 19imply that s can be sorted by safe paddings and safe reversals. Hence, thereexists a generalized sorting of p such that all paddings and all reversals but onein this sorting are safe. Therefore, this generalized sorting contains b(p) 2 c(p)1 h(p) 1 1 reversals. Lemma 9 implies that the generalized sorting of p mimicsan optimal (genuine) sorting of p by d(p) 5 b(p) 2 c(p) 1 h(p) 1 1reversals. e

In the following, we try to avoid creating 3-fortresses in the course of sortingby reversals. If we are successful in this task, the permutation p can be sorted inb(p) 2 c(p) 1 h(p) reversals. Otherwise, we show how to sort p in b(p) 2c(p) 1 h(p) 1 1 reversals and prove that such permutations can not be sortedwith fewer number of reversals. Permutation p is called a fortress if it has an oddnumber of hurdles and all these hurdles are superhurdles.

LEMMA 23. If r is a reversal destroying a fortress p with h(p)-superhurdles (i.e.,pr is not a fortress with h(p) superhurdles), then either r is unsafe or pr is a fortresswith h(p) 2 2 superhurdles.

PROOF. Every reversal acting on a permutation can reduce the number ofhurdles by at most 2 and the only operation that can reduce the number ofhurdles by 2 is a merging of hurdles. Arguments similar to the proof of Lemma21 demonstrate that, if r does not merge hurdles, then r is unsafe. If a safereversal r does merge (super)hurdles L and M in p, then Lemma 16 implies thatevery such reversal reduces the number of hurdles by 2, and in the case h(p) .3, does not create new hurdles. Also, Lemma 20 implies that every superhurdlein p but L and M remains a superhurdle in pr, thus implying that pr is a fortresswith h(p) 2 2 superhurdles. e

LEMMA 24. If p is a fortress, then d(p) $ b(p) 2 c(p) 1 h(p) 1 1.

PROOF. Lemma 23 implies that every sorting of p either contains an unsafereversal or gradually decreases the number of superhurdles in p by transforminga fortress with h (super)hurdles into a fortress with h 2 2 (super)hurdles.Therefore, if sorting of p uses only safe reversals, then it will eventually lead to a3-fortress. Therefore, by Lemma 21, every sorting of a fortress contains at leastone unsafe reversal and hence d(p) $ b(p) 2 c(p) 1 h(p) 1 1. e

Finally, we formulate the duality theorem for sorting signed permutations byreversals.

THEOREM 5. For every permutation p,

d~p! 5 H b~p! 2 c~p! 1 h~p! 1 1, if p is a fortressb~p! 2 c~p! 1 h~p! , otherwise.

PROOF. If p has an even number of hurdles then safe paddings (Theorems 2),safe reversals in oriented components (Theorem 4) and safe hurdles mergings(Lemmas 18 and 19) lead to a generalized sorting of p by b(p) 2 c(p) 1 h(p)reversals.

23Transforming Cabbage into Turnip

Page 24: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

If p has an odd number of hurdles at least one of which is simple, then thereexists a safe reversal cutting this simple hurdle (Lemma 15). This safe reversalleads to a permutation with an even number of hurdles. Therefore, similar to theprevious case, there exists a generalized sorting of p using only safe paddings andb(p) 2 c(p) 1 h(p) safe reversals.

Therefore, if p is not a fortress, there exists a generalized sorting of p by b(p)2 c(p) 1 h(p). Lemma 9 implies that this generalized sorting mimics optimal(genuine) sorting of p.

If p is a fortress, there exists a sequence of safe paddings (Theorem 2), safereversals in oriented components (Theorem 4), and safe hurdles merging (Lem-ma 18) leading to a 3-fortress that can be sorted by a series of reversals havingexactly one unsafe reversal. Therefore, there exists a generalized sorting of pusing b(p) 2 c(p) 1 h(p) 1 1 reversals. Lemma 24 implies that thisgeneralized sorting mimics optimal (genuine) sorting of p with d(p) 5 b(p) 2c(p) 1 h(p) 1 1 reversals. e

8. Polynomial Algorithm

Lemmas 9, 18, 15, and 19, and Theorems 2, 4, and 5 motivate the algorithmReversal_Sort, which optimally sorts signed permutations.

Algorithm Reversal_Sort(p)1. while p is not sorted2. if p has a long cycle3. select a safe ( g , b)-padding r of p (Theorem 2)4. else if p has an oriented component5. select a safe reversal r in this component (Theorem 4)6. else if p has an even number of hurdles7. select a safe reversal r merging two hurdles in p (Lemmas 18 and 19)8. else if p has at least one simple hurdle9. select a safe reversal r cutting this hurdle in p (Lemmas 15 and 19)

10. else if p is a fortress with more than three superhurdles11. select a safe reversal r merging two (super)hurdles in p (Lemma 18)12. else /* p is a 3-fortress */13. select an (un)safe reversal r merging two arbitrary (super)hurdles in p14. p 4 p z r15. endwhile16. mimic (genuine) sorting of p using the computed generalized sorting of p (Lemma 9)

THEOREM 6. Reversal_Sort(p) optimally sorts a permutations p 5 (p1p2. . .

pn) in O(n4) time.

PROOF. Theorem 5 implies that Reversal_Sort provides generalized sorting ofp by a series of reversals and paddings containing d(p) reversals. Lemma 9implies that this generalized sorting mimics an optimal (genuine) sorting of p byd(p) reversals.

We sketch an O(n4) implementation of Reversal_Sort(p) (the description ofdata structures is omitted). Note that every iteration of while loop in Reversal_Sort reduces the amount complexity(p) 1 3d(p) by at least 1 thus implying thatthe number of iterations of Reversal_Sort is bounded by 4n. The most “expen-sive” iteration is a search for a safe reversal in an oriented component. Since forsimple permutations it can be implemented in O(n3) time, the overall runningtime of Reversal_Sort is O(n4). e

24 S. HANNENHALLI AND P. A. PEVZNER

Page 25: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

A more careful analysis of Reversal_Sort (omitted here) leads to furtherreduction of running time. Below, we describe a simpler version of Reversal_Sort,which does not use paddings and runs in O(n5) time.

Define

f~p! 5 H 1, if p is a fortress0, otherwise.

A reversal r is valid, if D(b 2 c 1 h 1 f ) 5 21. Proofs of Theorem 1 andLemma 24 imply that there is a reversal with D(b 2 c 1 h 1 f ) $ 21. Thisobservation and Theorem 5 imply the following:

THEOREM 7. For every permutation p there exists a valid reversal in p. Everysequence of valid reversals sorting p is optimal.

Theorem 7 motivates the following simple version of Reversal_Sort, which isvery fast in practice:

Algorithm Reversal_Sort_Simple(p)1. while p is not sorted2. select a valid reversal r in p (Theorem 7)3. p 4 p z r4. endwhile

Step 1 can be executed at most n times for a permutation of size n. There aren p (n 2 1) reversals to search among. Computing (b 2 c 1 h 1 f ) for anypermutation can be done in O(n2) time. Hence, the worst-case time complexityof the above algorithm is O(n5).

Reversal_Sort_Simple was implemented and tested on biological data. Theresults of these tests are described in Hannenhalli and Pevzner [1996], inparticular, we found optimal evolutionary scenarios for extensively rearrangedgenomes, which were considered as too hard to analyze in the previous studies.Experiments with Reversal_Sort_Simple on simulated data explained the mysteryof astonishing performance of previously suggested approximation algorithms forsorting signed permutations by reversals. A simple explanation for this perfor-mance is that the bound (1) is extremely tight since h(p) is small for “random”permutations and zero for most of the biological data.

ACKNOWLEDGMENTS. We are indebted to Vineet Bafna, Piotr Berman, WebbMiller, and Anatoly Rubinov for many helpful discussions and suggestions. Weare also grateful to Eric Boudreau for sending us unpublished experimental dataon gene orders of Chlamydomonas gelatinosa and Chlamydomonas reinhardtii fortesting Reversal_Sort_Simple. Both referees provided many valuable comments.

REFERENCES

AIGNER, M., AND WEST, D. B. 1987. Sorting by insertion of leading element. J. Combin. Theory 45,306 –309.

BAFNA, V., AND PEVZNER, P. 1995. Sorting by reversals: Genome rearrangements in plant or-ganelles and evolutionary history of X chromosome. Mol. Biol. Evol. 12, 239 –246.

BAFNA, V., AND PEVZNER, P. 1996. Genome rearrangements and sorting by reversals. SIAMJ. Comput. 25, 272–289.

BAFNA, V., AND PEVZNER, P. 1998. Sorting by transpositions. SIAM J. Disc. Math. 11, 224 –240.

25Transforming Cabbage into Turnip

Page 26: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

BERMAN, P., AND HANNENHALLI, S. 1996. Fast sorting by reversals. In Combinatorial PatternMatching, Proceedings of the 6th Annual Symposium (CPM’96). Lecture Notes in Computer Science.Springer-Verlag, Berlin, Germany, pp. 168 –185.

CAPRARA, A. 1997. Sorting by reversals is difficult. In Proceedings of the 1st Annual InternationalConference on Computational Molecular Biology (RECOMB97). pp. 75– 83.

COHEN, D., AND BLUM, M. 1993. Improved bounds for sorting pancakes under a conjecture,Manuscript.

EVEN, S., AND GOLDREICH, O. 1981. The minimum-length generator sequence problem is NP-hard. J. Algorithms 2, 311–313.

GATES, W. H., AND PAPADIMITRIOU, C. H. 1979. Bounds for sorting by prefix reversals. Disc. Math.27, 47–57.

HANNENHALLI, S. 1995. Polynomial algorithm for computing translocation distance between ge-nomes. In Combinatorial Pattern Matching, Proceedings of the 6th Annual Symposium (CPM’95).Lecture Notes in Computer Science. Springer-Verlag, Berlin, Germany, pp. 162–176.

HANNENHALLI, S., CHAPPEY, C., KOONIN, E., AND PEVZNER, P. 1995. Genome sequence compari-son and scenarios for gene rearrangements: A test case. Genomics, 30, 299 –311.

HANNENHALLI, S., AND PEVZNER, P. 1995. Transforming men into mice (polynomial algorithm forgenomic distance problem). In Proceedings of the 36th Annual IEEE Symposium on Foundations ofComputer Science. IEEE Computer Society Press, Los Alamitos, Calif., pp. 581–592.

HANNENHALLI, S., AND PEVZNER, P. 1996. To cut . . . or not to cut (applications of comparativephysical maps in molecular evolution). In Proceedings of the 7th Annual ACM–SIAM Symposium onDiscrete Algorithms (Atlanta, Ga., Jan. 28 –30). ACM, New York, pp. 304 –313.

HEYDARI, M., AND SUDBOROUGH, I. H. 1993. On sorting by prefix reversals and the diameter ofpancake networks, Manuscript.

JERRUM, M. 1985. The complexity of finding minimum-length generator sequences. Theoret.Comput. Sci. 36, 265–289.

KECECIOGLU, J., AND GUSFIELD, D. 1994. Reconstructing a history of recombinations from a set ofsequences. In Proceedings of the 5th Annual ACM–SIAM Symposium on Discrete Algorithms, ACM,New York, pp. 471– 480.

KECECIOGLU, J., AND RAVI, R. 1995. Of mice and men: Algorithms for evolutionary distancesbetween genomes with translocation. In Proceedings of the 6th Annual ACM–SIAM Symposium onDiscrete Algorithms (San Francisco, Calif., Jan. 22–24). ACM, New York, pp. 604 – 613.

KECECIOGLU, J., AND SANKOFF, D. 1995. Exact and approximation algorithms for the inversiondistance between two permutations. Algorithmica 13, 180 –210.

KECECIOGLU, J., AND SANKOFF, D. 1994. Efficient bounds for oriented chromosome inversiondistance. In Combinatorial Pattern Matching, Proceedings of the 5th Annual Symposium (CPM’94).Lecture Notes in Computer Science, vol. 807. Springer-Verlag, Berlin, Germany, pp. 307–325.

MAKAROFF, C. A., AND PALMER, J. D. 1988. Mitochondrial DNA rearrangements and transcrip-tional alterations in the male sterile cytoplasm of Ogura radish. Mol. Cell. Biol. 8, 1474 –1480.

NADEAU, J. H., AND TAYLOR, B. A. 1984. Lengths of chromosomal segments conserved sincedivergence of man and mouse. Proc. Natl. Acad. Sci. USA 81, 814 – 818.

PALMER, J. D., AND HERBON, L. A. 1988. Plant mitochondrial DNA evolves rapidly in structure,but slowly in sequence. J. Mol. Evolut. 27, 87–97.

PEVZNER, P. A., AND WATERMAN, M. S. 1995. Open combinatorial problems in computationalmolecular biology. In Proceedings of the 3rd Israel Symposium on Theory of Computing and Systems.IEEE Computer Society Press, Los Alamitos, Calif., pp. 158 –163.

SANKOFF, D. 1992. Edit distance for genome comparison based on non-local operations. InCombinatorial Pattern Matching, Proceedings of the 3rd Annual Symposium (CPM’92). Lecture Notesin Computer Science, vol. 644. Springer-Verlag, Berlin, Germany, pp. 121–135.

SANKOFF, D., CEDERGREN, R., AND ABEL, Y. 1990. Genomic divergence through gene rearrange-ment. In Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Chap. 26.Academic Press, Orlando, Fla., pp. 428 – 438.

SANKOFF, D., LEDUC, G., ANTOINE, N., PAQUIN, B., LANG, B. F., AND CEDERGREN, R. 1992. Geneorder comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl.Acad. Sci. USA 89, 6575– 6579.

STURTEVANT, A. H., AND DOBZHANSKY, T. 1936. Inversions in the third chromosome of wild racesof drosophila pseudoobscura, and their use in the study of the history of the species. Proc. Nat.Acad. Sci. 22, 448 – 450.

26 S. HANNENHALLI AND P. A. PEVZNER

Page 27: Transforming Cabbage into Turnip: Polynomial Algorithm for ...webdocs.cs.ualberta.ca/~ghlin/CMPUT606/Papers/sbrHP99.pdf · convincingly proved that genome rearrangements is a common

TARJAN, R., KAPLAN, H., AND SHAMIR, R. 1997. Faster and simpler algorithm for sorting byreversals. In Proceedings of the 8th Annual ACM–SIAM Symposium on Discrete Algorithms. ACM,New York, pp. 614 – 623.

WATTERSON, G. A., EWENS, W. J., HALL, T. E., AND MORGAN, A. 1982. The chromosome inversionproblem. J. Theoret. Biol. 99, 1–7.

RECEIVED FEBRUARY 1995; REVISED JULY 1998; ACCEPTED SEPTEMBER 1998

Journal of the ACM, Vol. 46, No. 2, March 1999.

27Transforming Cabbage into Turnip