-IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number...

7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

1/78

Guest Editorial: WABI Special Section Part llJunhyong Kim and Inge Jonassen

THE Fourth International Workshop on Algorithms inBIoinformatics (WABI) 2004 was held in Bergen, Nor-way, September 2004. The program committee consisted of33 members and selected, among 117 submissions, 39 to bepresented at the workshop and included in the proceedingsfrom the workshop (volume 3240 of Lecture Notes inBioinformatics, series edited by Sorin Istrail, Pavel Pevzner,and Michael Waterman).

The WABI 2004 program committee selected a small

number of papers among the 39 to be invited to submit

extended versions of their papers to a special section of the

IEEE/ACM Transactions on Computational Biology and Bioin-

formatics. Four papers were published in the October-

December 2004 issue of the journal and this issue contains

an additional three papers. We would like to thank both the

entire program committee for WABI and the reviewers of

the papers in this issue for their valuable contributions.The first of the papers is A New Distance for High Level

RNA Secondary Structure Comparison authored by Julien

Allali and Marie-France Sagot. This paper describes algo-

rithms for comparing secondarystructures of RNAmolecules

wherethe structures arerepresentedby trees.The problem of

classifying RNA secondary structure is becoming critical as

biologists are discovering more and more noncoding func-

tional elements in the genome (e.g., miRNA). Most likely, themajor functional determinants of the elements are their

secondary structure and, therefore, a metric between such

secondary structures will also help delineate clusters of

functional groups. In Allali and Sagots paper, two tree

representations of secondary structure are compared by

analysing how one tree can be transformed into the other

using an allowed set of operations. Each operation can be

associated with a cost and thedistance between two trees can

then be defined as the minimum cost associated with a

transform of one tree to the other. Allali and Sagot introduce

two new operations that they name edge fusion and node

fusion and show that these alleviate limitations associated

with the classical tree edit operations used for RNAcomparison. Importantly, they also present algorithms for

calculating the distance between trees allowing the new

operations in addition to the classical ones, and analyze the

performance of the algorithms.

The second paper is Topological Rearrangements andLocal Search Method for Tandem Duplication Trees and isauthored by Denis Bertrand and Olivier Gascuel. The paperapproaches the problem of estimating the evolutionaryhistory of tandem repeats. A tandem repeat is a stretch ofDNA sequence that contains an element that is repeatedmultiple times and where the repeat occurrences are next toeach other in the sequence. Since the repeats are subject tomutations, they are not identical. Therefore, tandem repeatsoccur through evolution by copying (duplication) ofrepeat elements in blocks of varying size. Bertrand andGascuel address the problem of finding the most likelysequence of events giving rise to the observed set of repeats.Each sequence of events can be described by a duplicationtree and one searches for the tree that is the mostparsimonious, i.e., one that explains how the sequence hasevolved from an ancestral single copy with a minimumnumber of mutations along the branches of the tree. Themain difference with the standard phylogeny problem isthat linear ordering of the tandem duplications imposeconstraints the possible binary tree form. This paperdescribes a local search method that allows exploration ofthe complete space of possible duplication trees and showsthat the method is superior to other existing methods forreconstructing the tree and recovering its duplication

events.The third paper is Optimizing Multiple Seeds for

Homology Search authored by Daniel G. Brown. Thepaper presents an approach to selecting starting points forpairwise local alignments of protein sequences. Theproblem of pairwise local alignment is to find a segmentfrom each so that the two local segments can be aligned toobtain a high score. For commonly used scoring schemes,this can be solved exactly using dynamic programming.However, pairwise alignment is frequently applied to largedata sets and heuristic methods for restricting alignments tobe considered are frequently used, for instance, in theBLAST programs. The key is to restrict the number of

alignments as much as possible, by choosing a few goodseeds, without missing high scoring alignments. The papershows that this can be formulated as an integer program-ming problem and presents algorithm for choosing optimalseeds. Analysis is presented showing that the approachgives four times fewer false positives (unnecessary seeds) incomparison with BLASTP without losing more good hits.

Junhyong KimInge Jonassen

Guest Editors

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 1

. J. Kim is with the Department of Biology, University of Pennsylvania,3451 Walnut Street, Philadelphia, PA 19104.E-mail: [email protected].

. I. Jonassen is with the Department of Informatics and ComputationalBiology Unit, University of Bergen, HIB N5020 Bergen, Norway.E-mail: [email protected].

For information on obtaining reprints of this article, please send e-mail to:

[email protected]/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM


2/78

Junhyong Kim is the Edmund J. and LouiseKahn Term Endowed Professor in the Depart-mentof Biologyat the University of Pennsylvania.He holds joint appointments in the Department ofComputerand Information Science, Penn Centerfor Bioinformatics, and the Penn GenomicsInstitute. He serves on the editorial board ofMolecular Development and Evolution and theIEEE/ACM Transactions on Computational Biol-

ogy and Bioinformatics, thecouncilof theSocietyfor Systematic Biology, and the executive committee of the CyberInfrastructure for Phylogenetics Research. His research focuses oncomputational and experimental approaches to comparative develop-ment. The current focus of his lab is in three areas: computationalphylogenetics, in silico gene discovery, and comparative developmentusing genome-wide gene expression data.

Inge Jonassen is a professor of computerscience in the Department of Informatics at theUniversity of Bergen in Norway, where he ismember of the bioinformatics group. He is alsoaffiliated with the Bergen Center for Computa-tional Science at the same university where heheads the Computational Biology Unit. He is alsovice president of the Society for Bioinformatics inthe Nordic Countries (SocBiN) and a member of

the board of the Nordic Bioinformatics Network.He coordinates the technology platform for bioinformatics funded by theNorwegian Research Council functional genomics programme FUGE.He has worked in the field of bioinformatics since the early 1990s, wherehe has primarily focused on methods for discovery of patterns withapplications to biological sequences and structures and on methods forthe analysis of microarray gene expression data.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005


3/78

A New Distance for High Level RNASecondary Structure Comparison

Julien Allali and Marie-France Sagot

AbstractWe describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new

operations, called node fusionand edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in

the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent

RNAs and what is searchedfor is a commonstructuralcoreof twoRNAs. Althoughthe algorithmcomplexity hasan exponential term, this

term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The

algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.

Index TermsTree comparison, edit operation, distance, RNA, secondary structure.

1 INTRODUCTION

RNAS are one of the fundamental elements of a cell. Theirrole in regulation has been recently shown to be far

more prominent than initially believed (20 December 2002issue of Science, which designated small RNAs withregulatory function as the scientific breakthrough of theyear). It is now known, for instance, that there is massivetranscription of noncoding RNAs. Yet current mathematicaland computer tools remain mostly inadequate to identify,

analyze, and compare RNAs.An RNA may be seen as a string over the alphabet of

nucleotides (also called bases), {A, C, G, T}. Inside a cell,RNAs do not retain a linear form, but instead fold in space.The fold is given by the set of nucleotide bases that pair. The

main type of pairing, called canonical, corresponds to bondsof the type A U and G C. Other rarer types of bondsmay be observed, the most frequent among them is G U,also called the wobble pair. Fig. 1 shows the sequence of afolded RNA. Each box represents a consecutive sequence ofbonded pairs, corresponding to a helix in 3D space. Thesecondary structure of an RNA is the set of helices (or thelist of paired bases) making up the RNA. Pseudoknots,which may be described as a pair of interleaved helices, arein general excluded from the secondary structure of anRNA. RNA secondary structures can thus be represented asplanar graphs. An RNA primary structure is its sequence ofnucleotides while its tertiary structure corresponds to thegeometric form the RNA adopts in space.

Apart from helices, the other main structural elements in

an RNA are:

1. hairpin loops which are sequences of unpaired basesclosing a helix;

2. internal loops which are sequences of unpairedbases linking two different helices;

3. bulges which are internal loops with unpaired baseson one side only of a helix;

4. multiloops which are unpaired bases linking at leastthree helices.

Stems are successions of one or more among helices,

internal loops, and/or bulges.The comparison of RNA secondary structures is one of

the main basic computational problems raised by the study

of RNAs. It is the problem we address in this paper. The

motivations are many. RNA structure comparison has been

used in at least one approach to RNA structure prediction

that takes as initial data a set of unaligned sequences

supposed to have a common structural core [1]. For each

sequence, a set of structural predictions are made (for

instance, all suboptimal structures predicted by an algo-

rithm like Zuckers MFOLD [15], or all suboptimal sets of

compatible helices or stems). The common structure is then

found by comparing all the structures obtained from the

initial set of sequences, and identifying a substructure

common to all, or to some of the sequences. RNA structure

comparison is also an essential element in the discovery ofRNA structural motifs, or profiles, or of more general

models that may then be used to search for other RNAs of

the same type in newly sequenced genomes. For instance,

general models for tRNAs and introns of group I have been

derived by hand [3], [10]. It is an open question whether

models at least as accurate as these, or perhaps even more

accurate, could have been derived in an automatic way. The

identification of smaller structural motifs is an equally

important topic that requires comparing structures.As we saw, the comparison of RNA structures may

concern known RNA structures (that is, structures that were

experimentally determined) or predicted structures. The


. J. Allali is with the Institut Gaspard-Monge, Universite de Marne-la-Valle e, CiteDescartes, Champs-sur-Marne, 77454, Marne-la-Vallee Cedex2, France. E-mail: [email protected].

. M.-F. Sagot is with Inria Rhone-Alpes, UniversiteClaude Bernard, Lyon I,43 Bd du Novembre 1918, 69622 Villeurbanne cedex, France.E-mail: [email protected].

Manuscript received 11 Oct. 2004; accepted 20 Dec. 2004; published online30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:

[email protected], and reference IEEECS Log Number TCBB-0164-1004.1545-5963/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM


4/78

objective in both cases is the same: to find the common parts

of such structures.

In [11], Shapiro suggested to mathematically model RNA

secondary structures without pseudoknots by means of

trees. The trees are rooted and ordered, which means that

the order among the children of a node matters. This order

corresponds to the 5-3 orientation of an RNA sequence.

Given two trees representing each an RNA, there are two

main ways for comparing them. One is based on the

computation of the edit distance between the two trees

while the other consists in aligning the trees and using the

score of the alignment as a measure of the distance between

the trees. Contrary to what happens with sequences, the

two, alignment and edit distance, are not equivalent. The

alignment distance is a restrained form of the edit distance

between two trees, where all insertions must be performed

before any deletions. The alignment distance for general

trees was defined in 1994 by Jiang et al. in [9] and extended

to an alignment distance between forests in [6]. Morerecently, Hochsmann et al. [7] applied the tree alignment

distance to the comparison of two RNA secondary

structures. Because of the restriction on the way edit

operations can be applied in an alignment, we are not

concerned in this paper with tree alignment distance and

we therefore address exclusively from now on the problem

of tree edit distance.

Our way for comparing two RNA secondary structures is

thentoapplyanumberoftreeeditoperationsinoneorbothof

the trees representing the RNAs until isomorphic trees are

obtained. The currently most popular program using this

approach is probably theVienna package [5],[4]. Thetreeeditoperations considered are derived from the operations

classically applied to sequences [13]: substitution, deletion,

andinsertion. In 1989, Zhang andShasha [14] gave a dynamic

programming algorithm for comparing two trees. Shapiro

and Zhang then showed [12] how to use tree editing to

compare RNAs. The latter also proposed various tree models

that could be used for representing RNA secondary struc-

tures. Each suggested tree offers a more or less detailed view

of an RNA structure. Figs. 2b, 2c, 2d, and 2e present a few

examples of such possible views for the RNA given in Fig. 2a.

In Fig. 2, the nodes of the tree in Fig. 2b represent either

unpaired bases (leaves) or paired bases (internal nodes). Each

node is labeled with, respectively, a base or a pair of bases. A

node of the tree in Fig. 2c represents a set of successive

unpaired bases or of stacked paired ones. The label of a node

is an integer indicating, respectively, thenumber of unpaired

bases or the height of the stack of paired ones. The nodes of the

tree in Fig. 2d represent elements of secondary structure:

hairpin loop (H), bulge (B), internal loop (I),or multiloop (M).

The edges correspond to helices. Finally, the tree in Fig. 2e

contains only the information concerning the skeleton of

multiloops of an RNA. Thelast representation,though giving

a highlysimplified view of an RNA, is importantnevertheless

as it is generally accepted that it is this skeleton which is

usually the most constrained part of an RNA. The last two

models may be enriched with information concerning, for

instance, the number of (unpaired) bases in a loop (hairpin,

internal, multi) or bulge, and the number of paired bases in a

helix.The first label thenodesof thetree, thesecond itsedges.

Other types of information may be added (such as overall

composition of the elements of secondary structure). In fact,

one could consider working with various representations

simultaneously or in an interlocked, multilevel fashion. This

goes beyond the scope of this paper which is concerned with

comparing RNA secondary structures using any one among

the many tree representations possible. We shall, however,

comment further on this multilevel approach later on.

Concerning the objectives of this paper, they are twofold.

The first is to give some indications on why the classical edit

operations that have been considered so far in the literature

for comparing trees present some limitations when the trees

stand for RNA structures. Three cases of such limitations will

be illustrated through examples in Section 3. In Section 4, we

then introduce two novel operations, so-called node-fusion

and edge-fusion, that enable us to address some of these

limitations and then give a dynamic programming algorithm

for comparing twoRNA structures with these twoadditional

operations. Implementation issues and initial results are

presentedin Section 4. In Section 5, we give a first application


Fig. 1. Primary and secondary structures of a transfer RNA.

Fig. 2. Example of different tree representations ((b), (c), (d), and (e)) of

the same RNA (a).


5/78

of our algorithm to the comparison of two RNA secondary

structures. Finally, in Section 6, we sketch the main ideas

behind the multilevel RNA comparison approach mentioned

above.Before that, we start by introducing some notation and

by recalling in the next section the basics about classical tree

edit operations and tree mapping.

This paper is an extended version of a paper presented at

the Workshop on Algorithms in BioInformatics (WABI) in

2004, in Bergen, Norway. A few more examples are given to

illustrate some of the points made in the WABI paper,

complexity and implementation issues are discussed in

more depth as are the cost functions and a multilevel

approach to comparing RNAs.

2 TREE EDITING AND MAPPING

Let T be an ordered rooted tree, that is, a tree where the

order among the children of a node matters. We define

three kinds of operations on T: deletion, insertion, and

relabeling (corresponding to a substitution in sequencecomparison). The operations are shown in Fig. 3. The

deletion (Fig. 3b) of a node u removes u from the tree. The

children ofu become the children ofus father. An insertion

(Fig. 3c) is the symmetric of a deletion. Given a node u, we

remove a consecutive (in relation to the order among the

children) set u1; . . . ; up of its children, create a new node v,

make v a child ofu by attaching it at the place where the set

was, and, finally, make the set u1; . . . ; up (in the same order)

the children of v. The relabeling of a node (Fig. 3d) consists

simply in changing its label.

Given two trees T and T0, we define S fs1 . . . seg to be

a series of edit operations such that, if we apply succes-

sively the operations in S to the tree T, we obtain T0 (i.e., T

and T0 become isomorphic). A series of operations like S

realizes the editing of T into T0 and is denoted by T S

T0.

We define a function cost from the set of possible edit

operations (deletion, insertion, relabeling) to the integers (or

the reals) such that costs is the score of the edit operation s.

IfSis a series of edit operations, we define by extension that

costS isP

s2Scosts. We can define the edit distance between

two trees as the series of operations that performs the

editing of T into T0 and such that its cost is minimal:

distanceT ; T

0

fmincostSjT

S

T

0

g.

Let an insertion or a deletion cost one and the relabeling of

a node cost zero if the label is the sameand one otherwise.For

the two trees of the figure on the left, the series relabelA

F:deleteB:insertG realizes the editing of the left tree into

the right one and costs 3. Another possibility is the series

deleteB:relabelA G:insertF which also costs 3. The

distance between these two trees is 3.

Given a series of operations S, let us consider the nodes

of T that are not deleted (in the initial tree or after some

relabeling). Such nodes are associated with nodes of T0. The

mapping MS relative to S is the set of couples u; u0 with

u 2 T and u0 2 T0 such that u is associated with u0 by S.

The operations described above are the classical tree editoperations that have been commonly used in the literature

for RNA secondary structure comparison. We now present a

few results obtained using such classical operations that will

allow us to illustratea fewlimitations they maypresent when

used for comparing RNA structures.

3 LIMITATIONS OF CLASSICAL TREE EDITOPERATIONS FOR RNA COMPARISON

As suggested in [12], the tree edit operations recalled in the

previous section can be used on any type of tree coding of

an RNA secondary structure.Fig. 4 shows two RNAsePs extracted from the database [2]

(they are found, respectively, in Streptococcus gordonii and

Thermotoga maritima). For the example we discuss now, we

code the RNAs using the tree representation indicated in

Fig. 2b where a node represents a base pair and a leaf an

unpaired base. After applying a few edit operations to the

trees, we obtain the result indicated in Fig. 4, with deleted/

insertedbasesingray.Wehavesurroundedafewregionsthat

match in the two trees. Bases in the rectangular box at the

bottom of the RNA on the left are thus associated with bases in

thebottom rightmostrectangular boxof theRNA on theright.

The same is observed for the bases in the oval boxes for bothRNAs. Such matches illustrateone of themain problems with

the classical tree edit operations: Bases in one RNA may be

mapped to identically labeled bases in the other RNA to

minimise the total cost, while such bases should not be

associated in terms of the elements of secondary structure to

which they belong. In fact, such elements are often distant

from one another along the common RNA structure. We call

this problem the scattering effect. It is related to the

definition of tree edit operations. In the case of this example

and of the representation adopted, the problem might have

been avoided if structural information had been used.

Indeed, the problem appears also because the structural

ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 5

Fig. 3. Edit operations: (a) the original tree T, (b) deletion of the node

labelled D, (c) insertion of the node labeled I, and (d) relabeling of a

node in T (the label A of the root is changed into K).


6/78

location of an unpaired base is not taken into account. It is

therefore possible to match, for instance, an unpaired basefrom a hairpin loop with an unpaired base from a multiloop.

Using another type of representation, as we shall do, would,

however, not be enough to solve all problems as we see next.

Indeed, to compare the same two RNAs, we can also use a

more abstract tree representation such as the one given in

Fig. 2d. In this case, the internal nodes represent a multiloop,

internal-loop, or bulge, the leaves code for hairpin loops and

edges forhelices. Theresult of theeditionofTinto T0 forsome

cost function is presented in Fig. 5 (weshallcome back later to

the cost functions used in the case of such more abstract RNA

representations; for the sake of this example, we may assume

an arbitrary one is used).The problem we wish to illustrate in this case is shown

by the boxes in the figure. Consider the boxes at the bottom.

In the left RNA, we have a helix made up of 13 base pairs. In

the right RNA, the helix is formed by seven base pairs

followed by an internal loop and another helix of size 5. By

definition (see Section 2), the algorithm can only associateone element in the first tree to one element in the second

tree. In this case, we would like to associate the helix of the

left tree to the two helices of the second tree since it seems

clear that the internal loop represents either an inserted

element in the second RNA, or the unbonding of one base

pair. This, however, is not possible with classical edit

operations.

A third type of problem one can meet when using only

the three classical edit operations to compare trees standing

for RNAs is similar to the previous one, but concerns this

time a node instead of edges in the same tree representa-

tion. Often, an RNA may present a very small helix betweentwo elements (multiloop, internal-loop, bulge, or hairpin-

loop) while such helix is absent in the other RNA. In this

case, we would therefore have liked to be able to associate

one node in a tree representing an RNA with two or more


Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarumand of

Saccharomyces kluveri, using the model given in Fig. 2d.

Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgoniiand of Thermotoga maritima,

using the model given in Fig. 2b.


7/78

nodes in the tree for the other RNA. Once again, this is not

possible with any of the classical tree edit operations. An

illustration of this problem is shown in Fig. 6.We shall use RNA representations that take the elements

of the structure of an RNA into account to avoid some of the

scattering effect. Furthermore, in addition to considering

information of a structural nature, labels are attached, in

general, to both nodes and edges of the tree representing an

RNA. Such labels are numerical values (integers or reals).

They represent in most cases the size of the correspondingelement, but may also further indicate its composition, etc.

Such additional information is then incorporated into the

cost functions for all three edit operations. It is important to

observe that when dealing with trees labeled at both the

nodes and edges, any node and the edge that leads to it (or,

in an alternative perspective, departs from it) represent a

single object from the point of view of computing an edit

distance between the trees.It remains now to deal with the last two problems that

are a consequence of the one-to-one associations between

nodes and edges enforced by the classical tree edit

operations. To that purpose, we introduce two novel tree

edit operations, called the edge fusion and the node fusion.

4 INTRODUCING NOVEL TREE EDIT OPERATIONS

4.1 Edge Fusion and Node Fusion

In order to address some of thelimitations of theclassical tree

edit operations that were illustrated in the previous section,

we need to introduce two novel operations. These arethe edge

fusion and the node fusion. They may be applied to any of the

tree representations given in Figs. 2c, 2d, and 2e.

An example of edge fusion is shown in Fig. 7a. Let eu bean

edge leading to a node u, ci a child of u and eci the edge

between u and ci. The edge fusion of eu and eci consists in

replacing eci and eu with a new single edge e. The edge e links

the father of u to ci. Its label then becomes a function of the

(numerical) labels ofeu, u and eci . For instance, if such labels

indicatedthe size of each element (e.g.,for a helix,the number

ofitsstackedpairs,andforaloop,the min , max ortheaverage

of its unpaired bases on each side of the loop), the label of e

could be the sum of the sizes of eu, u and eci . Observe that

merging two edges implies deleting all subtrees rooted at the

children cj ofu forj different fromi. Thecost of such deletions

is added to the cost of the edge fusion.An example of node fusion is given in Fig. 7b. Let u be a

node and ci one of its children. Performing a node fusion of

u and ci consists in making u the father of all children of ciand in relabeling u with a value that is a function of thevalues of the labels of u, ci and of the edge between them.

Observe that a node fusion may be simulated using the

classical edit operations by a deletion followed by a

relabeling. However, the difference between a node fusion

and a deletion/relabeling is in the cost associated with both

operations. We shall come back to this point later.Obviously, like insertions or deletions, edge fusions and

node fusions have of course symmetric counterparts, which

are the edge split and the node split.Given two rooted, ordered, and labeled trees T and T0,

we define the edit distance with fusion between T and T0


Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.

Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and

Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.


8/78

as distancefusionT ; T0 fmincostSjT

ST0g with costs the

cost associated to each of the seven edit operations nowconsidered (relabeling, insertion, deletion, node fusion andsplit, edge fusion and split).

Proposition 1. If the following is verified:

. costmatcha; b is a distance,

. costinsa costdela ! 0,

. costnodefusion a;b;c costnodesplit a;b;c ! 0, and

. costedgefusion a;b;c costedgesplit a;b;c ! 0,

then distancefusion is indeed a distance.

Proof. The positiveness of distancefusion is given by the factthat all elementary cost functions are positive. Its

symmetry is guaranteed by the symmetry in the costsof the insertion/deletion and (node/edge) fusion/split

operations. Finally, it is straighforward to see that

distancefusion satisfies triangular inequality. tu

Besides the above properties that must be satisfied by thecost functions in order to obtain a distance, others may beintroduced for specific purposes. Some will be discussed in

Section 5.We now present an algorithm to compute the tree edit

distance between two trees using the classical tree edit

operations plus the two operations just introduced.

4.2 Algorithm

The method we introduce is a dynamic programming

algorithm based on the one proposed by Zhang and Shasha.Their algorithm is divided in two parts: They first compute

the edit distance between two trees (this part is denoted by

TDist) and then the distance between two forests (this partis denoted by FDist). Fig. 8 illustrates in pictorial form the

part TDist and Fig. 9 the FDist part of the computation.In order to take our two new operations into account, we

need to compute a few more things in the TDist part.Indeed, we must add the possibility for each tree to have a

node fusion (inversely, node split) between the root and one

of its children, or to have an edge fusion (inversely edge

split) between the root and one of its children. These

additional operations are indicated in the right box of Fig. 8.We present now a formal description of the algorithm. Let

T be an ordered rooted tree with jTj nodes. We denote by ti

the ith node in a postfix order. For each node ti, li is the

index of the leftmost child of the subtree rooted at ti. Let

Ti . . .j denote the forest composed by the nodes ti . . . tj(T T0 . . . jTj. To simplify notation, from now on, when

there is no ambiguity, i will refer to the node ti. In this case,

distancei1 . . . i2; j1 . . .j2 will be equivalent to distanceTi1. . . i2; T

0j1 . . .j2.The algorithm of Zhang and Sasha is fully described by

the following recurrence formula:

if i1 li2 andj1 lj2

MI N

distance i1 . . . i2 1 ; j1 . . .j2 costdeli2

distance i1 . . . i2 ; j1 . . .j2 1 costinsj2

distance i1 . . . i2 1 ; j1 . . .j2 1 costmatchi2; j2

8>:

1

else

MI N

distance i1 . . . i2 1 ; j1 . . .j2

costdeli2

distance i1 . . . i2 ; j1 . . .j2 1

costinsj2

distance i1 . . . li2 1 ; j1 . . . lj2 1

distance li2 . . . i2 ; lj2 . . .j2

8>>>>>>>>>>>>>>>:

2

Part (1) of the formula corresponds to Fig. 8, while part (2)

corresponds to Fig. 9. In practice, the algorithm stores in a

matrix the score between each subtree ofT and T0. The space

complexityis thereforeOjTj jT0j.Toreachthiscomplexity,

the computation must be done in a certain order (see

Section 4.3). The time complexity of the algorithm is

OjTj minleafT; heightT

jT0j minleafT0; heightT0;

where leafT and heightT represent, respectively, the

number of leaves and the height of a tree T.


Fig. 8. Zhang and Sashas dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to

take fusion into account.


9/78

The formula to compute the edit score allowing for both

node and edge fusions follows.

if i1 ! lik and j1 ! ljk0

MI N

distancefi1 . . . ik1g; ;; fj1 . . .jk0 g; path0 costdelikdistancefi1 . . . ikg; path; fj1 . . .jk01g; ; costinsjk0

distancefi1 . . . ik1g; ;; fj1 . . .jk01g; ; costmatchik; jk0

for each child ic of ik in fi1; . . . ; ikg; set il lic

distancefi1 . . . ic1; ic1 . . . ikg; path:u; ic; fj1 . . .jk0 g;

path0

costnode f usionic; ikobs: :ik data are changed

distancefil . . . ic1; ikg; path:e; ic; fj1 . . .jk0 g; path0

costedge f usionic; ik distancefi1 . . . il1g;

;; ;; ;

distancefic1 . . . ik 1; ;; ;; ;

obs: : ik data are changed

for each child jc0 of jk0 in fj1; . . . ; jk0 g; set jl0 ljc0

distancefi1 . . . ikg; path; fj1 . . .jc01; jc01 . . .jk0 ;

path0:u; jc0

costnode splitjc0 ; jk0

obs: : jk0 data are changed

distancefi1 . . . ikg; path; fjl0 . . .jc0 ; jk0 ; path0:e; jc0

costedge splitjc0 ; jk0

distance;; ;; fj1 . . .jl0 1g; ;

distance;; ;; jc01 . . .jk01; ;

obs: : jk

0 data are changed

8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:

3

else set il lik and jl0 ljk0

MI N

distancefi1 . . . ik1g; ;; fj1 . . .jk0 g; path0 delik

distancefi1 . . . ikg; path; fj1 . . .jk01g; ; insjk0

distancefi1 . . . il1g; ;; fj1 . . .jl01g; ;

distancefil . . . ikg; path; fjl0 . . .jk0 g; path0

8>>>>>:

4

Given two nodes u and v such that v is a child of u,

node fusionu; v is the fusion of node v with u, and

edge fusionu; v is the edge fusion between the edges

leading to, respectively, nodes u and v. The symmetric

operations are denoted by, respectively, node splitu; v and

edge splitu; v.The distance computation takes two new parameters

path and path0. These are sets of pairs e or u;v which

indicate, for node ik (respectively, jk), the series of fusions

that were done. Thus, a pair e; v indicates that an edge

fusion has been perfomed between ik and v, while for u; va node v has been merged with node ik.

The notation path:e; v indicates that the operation e; v

has been performed in relation to node ik and the

information is thus concatenated to the set path of pairs

currently linked with ik.

4.3 Implementation and Complexity

The previous section gave the recurrence formul for

calculating the edit distance between two trees allowing for

node and edge fusion and split. We now discuss the

complexity of the algorithm. This requires paying attention

to some high-level implementation details that, in the case

of the tree edit distance problem, may have an important

influence on the theoretical complexity of the algorithm.

Such details were first observed by Zhang and Shasha. They

concern the order in which to perform the operations

indicated in (2) and (1) to obtain an algorithm that is time

and space efficient.Let us consider the last line of (2). We may observe that

the computation of the distance between two forests refersto the computation of the distance between two treesTli2 . . . i2 and T

0lj2 . . .j2. We must therefore memor-ise the distance between any two subtrees of T and T0.Furthermore, we have to carry out the computation from

the leaves to the root because when we compute thedistance between two subtrees U and U0, the distancebetween any subtrees of U and U0 must already have beenmeasured. This explains the space complexity which is inOjTj jT0j and corresponds to the size of the table used forstoring such distances in memory.

If we look at (1) now, we see that it is not necessary tocalculate separately the distance between the subtreesrooted at i0 and j0 if i0 is on the path from li to i and j0

is on the path from lj to j, for i and j nodes of,respectively, T and T0.

We define a set LRT of the left roots of T as follows:

LRT fkj1 k jTj and 69k0 > k such that lk0 lkg


Fig. 9. Zhang and Sashas dynamic programming algorithm: the forest distance part.


10/78

The algorithm for computing the edit distance between t

and T0 consists then in computing the distance between

each subtree rooted at a node in LRT and each subtree

rooted at a node in LRT0. Such subtrees are considered

from the leaves to the root of T and T0, that is, in the order

of their indexes.

Zhang and Shasha proved that this algorithm has atime complexity in OjTj minleafT; heightT jT0j

minleafT0; heightT0, leafT designating the num-

ber of leaves of T and heightT its height. In the worst

case (fan tree), the complexity is in OjTj2 jT0j2.Taking fusion and split operations into account does

not change the above reasoning. However, we must now

store in memory the distance between all subtrees

Tli2 . . . i2 and T0lj2 . . .j2, and all the possible values

of path and path0.We must therefore determine the number of values that

path can take. This amounts to determine the total number

of successive fusions that could be applied to a given node.

We recall that path is a list of pairs e or u; v. Let path

fe or u; v1; e or u; v2; . . . ; e or u; vg be the list for node i

ofT. The first fusion can be performed only with a child v1of i. If d is the maximum degree of T, there are d possible

choices for v1. The second fusion can be done with one of

the children of i or with one of its grandchildren. Let v2 be

the node chosen. There are d + d2 possible choices for v2.

Following the same reasoning, there arePk

k1 dk possible

choices for the th node v to be fusioned with i.

Furthermore, we must take into account the fact that a

fusion can concern a node or an edge. The total number of

values possible for the variable path is therefore:

2 Ykk1

Xjkj1

dj 2lYkk1

dk1 1d 1

;

that is:

2 1

d 1

Ykk1

dk1 1 < 2l 1

d 1

ld

122 :

A node i may then be involved in O2dl possible

successive (node/edge) fusions.As indicated, we must store in memory the distance

between each subtree Tli2 . . . i2 and T0lj2 . . .j2 for all

possible values of path and path

0

. The space complexity of

our algorithm is thus in O2d 2d0 jTj jT0j, with dand d0 the maximum degrees of, respectively, T and T0.

The computation of the time complexity of our algorithm

is done in a similar way as for the algorithm of Zhang and

Shasha. For each node of T and T0, one must compute the

number of subtree distance computations the node will be

involved in by considering all subtrees rooted in, respec-tively, a node of LRT and a node of LRT0. In our case,

one must also take into account for each node the possibility

of applying a fusion. This leads to a time complexity in

O2d jTj minleafT; heightT 2d0 jT0j

minleafT0; heightT0:

This complexity suggests that the fusion operations may

be used only for reasonable trees (typically, less than

100 nodes) and small values of l (typically, less than 4). It is

however important to observe that the overall number of

fusions one may perform can be much greater than l

without affecting the worst-case complexity of the algo-

rithm. Indeed, any number of fusions can be made while

still retaining the bound of

O2dl jTj minleafT; heightT jT0j minleafT0;

heightT0

so long as one does not realize more than l consecutive

fusions for each node.In general, also, most interesting tree representations of

an RNA are of small enough size as will be shown next,

together with some initial results obtained in practice.

5 APPLICATION TO RNA SECONDARY STRUCTURESCOMPARISON

The algorithm presented in the previous section has beencoded using C++. An online version is available at http://www-igm.univ-mlv.fr/~allali/migal/.

We recall that RNAs are relatively small molecules with

sizes limited to a few kilobases. For instance, the small

ribosomal subunit of Sulfolobus acidocaldarius (D14876) is

made up of 1,147 bases. Using the representation shown in

Fig. 2b, the tree obtained contains 440 internal nodes and

567 leaves, that is 1,007 nodes overall. Using the representa-

tion in Fig. 2d, the tree is composed of 78 nodes. Finally, thetree obtained using the representation given in Fig. 2e

contains only 48 nodes. We therefore see that even for large

RNAs, any of the known abstract tree-representations (that

is, representations which take the elements of the secondary

structure of an RNA into account) that we can use leads to a

tree of manageable size for our algorithm. In fact, for small

values of l (2 or 3), the tree comparison takes reasonable

time (a few minutes) and memory (less than 1Gb).

As we already mentioned, a fusion (respctively, split) can

be viewed as an alternative to a deletion (respectively,

insertion) followed by a relabeling. Therefore, the cost

function for a fusion must be chosen carefully.



11/78

To simplify, we reason on the cost of a node fusion

without considering the label of the edges leading to the

nodes that are fusioned with a father. The formal definition

of the cost functions takes the edges also into account.

Let us assume that the cost function returns a realvalue between zero and one. If we want to compute thecost of a fusion between two nodes u and v, the aim is togive to such fusion a cost slightly greater than the cost ofdeleting v and relabeling u; that is, we wish to havecostnode f usionu; v mincostdelv t; 1. The parameter tis a tuning parameter for the fusion.

Suppose that the new node w resulting from the fusion of

u and v matches with another node z. The cost of this match

is costmatchw; z. If we do not allow for node fusions, the

algorithm will first match u with z, then will delete v. If we

compare the two possibilities, on one hand we have a total

cost of costnode f usionu; v costmatchw; z for the fusion,

that is, costdelv t costmatchw; z, on the other hand, acost ofcostdelv costmatchu; z. Thus, t represents the gain

that must be obtained by costmatchw; z with regard to

costmatchu; z, that is, by a match without fusion. This is

illustrated in Fig. 10.

In this example,the cost associated with thepathon thetop

is costmatch5; 9 costdel3. The pathat the bottomhasa cost

of costnode f usion5; 3 costdel3 t for the node fusion to

whichis added a relabeling cost ofcostmatch8; 9, leading toa

total of costmatch8; 9 costdel3 t. A node fusion will

therefore be chosen if costmatch8; 9 t > costmatch5; 9,

therefore if the score of a match with fusion is better by at

least t than a match without fusion.

We applythe same reasoning to the cost of an edge fusion.

The cost function for a node and an edge fusion between a

node u and a node v, with eu denoting the edge leading to u

and ev the edge leading to v is defined as follows:

costnode f usionu; v costdelv costdelev t

costedge f usionu; v costdelu costdeleu t

X

csibling ofv

cost deleting subtree rooted at c:

The tuning parameter t is thus an important parameter

that allows us to control fusions. Always considering a cost

function that produces real values between 0 and 1, if t is

equal to 0:1, a fusion will be performed only if it improves

the score by 0:1. In practice, we use values of t between 0

and 0:2.For practical considerations, we also set a further

condition on the cost and relabeling functions related to a

node or edge resulting from a fusion which is as follows:

costdela costdelb ! costdelc

with c the label of the node/edge resulting from the fusion

of the nodes/edges labeled a and b. Indeed, if this condition

is not fulfilled, the algorithm may systematically fusion the

nodes or edges to reduce the overall cost.An important consequence of the conditions seen above

is that a node fusion cannot be followed by an edge fusion.

Below, the node fusion followed by an edge fusion costs:

costdelb costdelB t costdelAB costdela t:

ThealternativeistodestroynodeB(togetherwith edgeb)andthen to operate an edge fusion, the whole costing: costdelb

costdelB costdelA costdela t. The difference be-

tween these two costs is t costdelAB costdelA,whichis

always positive.

This observation allows to significantly improve the

performance in practice of the algorithm.We have applied the new algorithm on the two RNAs

shown in Fig. 5 (these are eukaryotic nuclear P RNAs from

Saccharomyces uvarum and Saccharomyces kluveri) and coded

using the same type of representation as in Fig. 2d. We have

limited the number of consecutive fusions to one (l 1).

The computation of the edit distance between the two trees

taking node and edge fusions into account besides dele-

tions, insertions, and relabeling has required less than a

second. The total cost allowing for fusions is 6:18 with t

0:05 against 7:42 without fusions. As indicated in Fig. 11, the

last two problems discussed in Section 3 disappear thanks

to some edge fusions (represented by the boxes).An example of node fusions required when comparing

two real RNAs is given in Fig. 12. The RNAs are coded

using the same type of representation as in Fig. 2d. The

figure shows part of the mapping obtained between the

small subunits of two ribosomal RNAs retrieved from [8]

(from Bacillaria paxillifer and Calicophoron calicophorum). The

node fusion has been circled.


Fig. 10. Illustration of the gain that must be obtained using a fusion

instead of a deletion/relabeling.


12/78

6 MULTILEVEL RNA STRUCTURE COMPARISON:SKETCH OF THE MAIN IDEA

We briefly discuss now an approach which addresses in

part the scattering effect problem (see Section 2). This

approach is being currently validated and will be more fully

described in another paper. We therefore present here the

main idea only.

To start with, it is important to understand the nature of

this scattering effect. Let us consider first a trivial case: the

cost functions are unitary (insertion, deletion, and relabeling

each cost 1) and we compute the edit distance between two

trees composed of a single node each. The obtained mapping

will associate the single node in the first tree with the singleone in the second tree, independently from the labels of the

nodes. This example can be extended to the comparison of

two trees whose node labels are all different. In this case, the

obtained mapping corresponds to the maximum home-

omorphic subtree common to both trees.If the two RNA secondary structures compared using a

tree representation which models both the base pairs and

the nonpaired bases are globally similar but present some

local dissimilarity, then an edit operation will almost

always associate the nodes of the locally divergent regions

that are located at the same positions relatively to the globalcommon structure. This is a normal, expected behavior in

the context of an edition. However, it seems clear also when

we look at Fig. 4 that the bases of a terminal loop should not

be mapped to those of a multiple loop.To reduce this problem, one possible solution consists of

adding to the nodes corresponding to a base an information

concerning the element of secondary structure to which the

base belongs. The cost functions are then adapted to take

this type of information into account. This solution,

although producing interesting results, is not entirely

satisfying. Indeed, the algorithm will tend to systematically

put into correspondence nodes (and, thus, bases) belonging

to structural elements of the same type, which is also not

necessarily a good choice as these elements may not be

related in the overall structure. It seems therefore preferable

to have a structural approach first, mapping initially the

elements of secondary structure to each other and taking

care of the nucleotides in a second step only.The approach we have elaborated may be briefly

described as follows: Given two RNA secondary structures,

the first step consists in coding the RNAs by trees of type c

in Fig. 2 (nodes represent bulges or multiple, internal or


Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.

Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.


13/78

terminal loops while edges code for helices). We thencompute the edit distance between these two trees using the

two novel fusion operations described in this paper. This

also produces a mapping between the two trees. Each node

and edge of the trees, that is, each element of secondary

structure, is then colored according to this mapping. Two

elements are thus of a same color if they have been mapped

in the first step. We now have at our disposal an

information concerning the structural similarity of the two

RNAs. We can then code the RNAs using a tree of type b.

To these trees, we add to each node the colour of the

structural element to which it belongs. We need now only to

restrict the match operation to nodes of the same color. Two

nodes can therefore match only if they belong to secondary

elements that have been identified in the first step as being

similar.To illustrate the use of this algorithm, we have applied it

to the two RNAs of Fig. 4. Fig. 13 presents the trees of type(Fig. 2c) coding for these structures, and the mappingproduced by the computation of the edit distance withfusion. In particular, the noncolored fine dashed nodes andedges correspond, respectively, to deleted nodes/edges.One can see that in the left RNA, the two hairpin loopsinvolved in the scattering effect problem in Fig. 4 (indicatedby the arrows) have been destroyed and will not be mappedto one another anymore when the edit operations are

applied to the trees of the type in Fig. 2b.This approach allows to obtain interesting results.

Furthermore, it considerably reduces the complexity of

the algorithm for comparing two RNA structures coded

with trees of the type in Fig. 2b. However, it is important to

observe that the scattering effect problem is not specific of

the tree representations of the type in Fig. 2b. Indeed, the

same problem may be observed, to a lesser degree, with

trees of the type in Fig. 2c. This is the reason why we

generalize the process by adopting a modelling of RNA

secondary structures at different levels of abstraction. This

model, and the accompanying algorithm for comparing

RNA structures, is in progress.

7 FURTHER WORK AND CONCLUSIONWe have proposed an algorithm that addresses two main

limitations of the classical tree edit operations for compar-

ing RNA secondary structures. Its complexity is high in

theory if many fusions are applied in succession to any

given (the same) node, but the total number of fusions that

may be performed is not limited. In practice, the algorithm

is fast enough for most situations one can meet in practice.To provide a more complete solution to the problem of

the scattering effect, we also proposed a new multilevel

approach for comparing two RNA secondary structures

whose main idea was sketched in this paper. Further details

and evaluation of such novel comparison scheme will be thesubject of another paper.

REFERENCES[1] D. Bouthinon and H. Soldano, A New Method to Predict the

Consensus Secondary Structure of a Set of Unaligned RNASequences, Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.

[2] J.W. Brown, The Ribonuclease P Database, Nucleic AcidsResearch, vol. 24, no. 1, p. 314, 1999.

[3] N. el Mabrouk and F. Lisacek, and Very Fast Identification ofRNA Motifs in Genomic DNA. Application to tRNA Search in theYeast Genome, J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.

[4] I. Hofacker, The Vienna RNA Secondary Structure Server, 2003.[5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.

Tacker, and P. Schuster, Fast Folding and Comparison of RNA

Secondary Structures, Monatshefte fur Chemie, vol. 125, pp. 167-188, 1994.[6] M. Hochsmann, T. Toller, R. Giegerich, and S. Kurtz, Local

Similarity in RNA Secondary Structures, Proc. IEEE Computer Soc.Conf. Bioinformatics, p. 159, 2003.

[7] M. Hochsmann, B. Voss, and R. Giegerich, Pure Multiple RNASecondary Structure Alignments: A Progressive Profile Ap-proach, IEEE/ACM Trans. Computational Biology and Bioinfor-matics, vol. 1, no. 1, pp. 53-62, 2004.

[8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, TheEuropean Database on Small Subunit Ribosomal RNA, Nucleic

Acids Research, vol. 30, no. 1, pp. 183-185, 2002.[9] T. Jiang, L. Wang, and K. Zhang, Alignment of TreesAn

Alternative to Tree Edit, Proc. Fifth Ann. Symp. CombinatorialPattern Matching, pp. 75-86, 1994.

[10] F. Lisacek, Y. Diaz, and F. Michel, Automatic Identification ofGroup I Intron Cores in Genomic DNA Sequences, J. Molecular

Biology, vol. 235, no. 4, pp. 1206-1217, 1994.


Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting

from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for

hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.


14/78

[11] B. Shapiro, An Algorithm for Multiple RNA Secondary Struc-tures, Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387-393, 1988.

[12] B.A. Shapiro and K. Zhang, Comparing Multiple RNA SecondaryStructures Using Tree Comparisons, Computer Applications in theBiosciences, vol. 6, no. 4, pp. 309-318, 1990.

[13] K.-C. Tai, The Tree-to-Tree Correction Problem, J. ACM, vol. 26,no. 3, pp. 422-433, 1979.

[14] K. Zhang and D. Shasha, Simple Fast Algorithms for the Editing

Distance between Trees and Related Problems, SIAM J. Comput-ing, vol. 18, no. 6, pp. 1245-1262, 1989.[15] M. Zuker, Mfold Web Server for Nucleic Acid Folding and

Hybridization Prediction, Nucleic Acids Research, vol. 31, no. 13,pp. 3406-3415, 2003.

Julien Allali studied at the University of Marnela Vallee (France), where he received the MScdegree in computer science and computationalgenomics. In 2001, he began his PhD incomputational genomics at the Gaspard MongeInstitute of the University of Marne la Vallee. Histhesis focused on the study of RNA secondarystructures and, in particular, their comparisonusing a tree distance. In 2004, he received the

PhD degree.

Marie-France Sagot received the BSc degree in computer science fromthe University of Sao Paulo, Brazil, in 1991, the PhD degree intheoretical computer science and applications from the University ofMarne-la-Vallee, France, in 1996, and the Habilitation from the sameuniversity in 2000. From 1997 to 2001, she worked as a researchassociate at the Pasteur Institute in Paris, France. In 2001, she movedto Lyon, France, as a research associate at the INRIA, the FrenchNational Institute for Research in Computer Science and Control. Since2003, she has been the Director of Research at the INRIA. Her researchinterests are in computational biology, algorithmics, and combinatorics.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.



15/78

Topological Rearrangements and Local SearchMethod for Tandem Duplication Trees

Denis Bertrand and Olivier Gascuel

AbstractThe problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch

[4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing

numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication

trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,

TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree

Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these

restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is

applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all

existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to

tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any

other program.

Index TermsTandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc

finger genes.

1 INTRODUCTION

REPEATED sequences constitute an important fraction ofmost genomes, from the well-studied Escherichia colibacterial genome [1] to the Human genome [2]. For

example, it is estimated that more than 50 percent of the

Human genome consists of repeated sequences [2], [3].

There exist three major types of repeated sequences:

transposon-derived repeats, micro or minisatellites, and

large duplicated sequences, the last often containing one orseveral RNA or protein-coding genes. Micro or minisatel-

lites arise through a mechanism called slipped-strand

mispairing, and are always arranged in tandem: copies of

a same basic unit are linearly ordered on the chromosome.

Large duplicated sequences are also often found in tandem

and, when this is the case, unequal recombination is widely

assumed to be responsible for their formation.

Both the linear order among tandemly repeated se-

quences, and the knowledge of the biological mechanisms

responsible for their generation, suggest a simple model of

evolution by duplication. This model, first described by

Fitch in 1977 [4], introduces tandem duplication trees as

phylogenies constrained by the unequal recombination

mechanism. Although being a completely different biologi-

cal mechanism, slipped-strand mispairing leads to the same

duplication model [5]. A formal recursive definition of this

model is provided in Section 2, but its main features can be

grasped from the examples of Fig. 1. Fig. 1a shows the

duplication history of the 13 Antennapedia-class homeobox

genes from the cognate group [6]. In this history, the

ancestral locus has undergone a series of simple duplica-

tion events where one of the genes has been duplicated into

two adjacent copies. Starting from the unique ancestral

gene, this series of events has produced the extant locus

containing the 13 linearly ordered contemporary genes. It is

easily seen [7] that trees only containing simple duplication

events are equivalent to binary search trees with labeled

leaves. They differ from standard phylogenies in that node

children have left/right orientation. Fig. 1b shows another

example corresponding to the nine variable genes of the

human T cell receptor Gamma (TRGV) locus [8]. In this

history, the most recent event involves a double duplica-

tion where two adjacent genes have been simultaneously

duplicated to produce four adjacent copies. Duplication

trees containing multiple duplication events differ from

binary search trees, but are less general than phylogenies.

The model proposed by Fitch [4] covers both simple and

multiple duplication trees.

Fitchs paper [4] received relatively little attention at the

time of its publication probably due to the lack of available

sequence data. Rediscovered by Benson and Dong [9],

Tang et al. [10], and Elemento et al. [8], tandemly repeated

sequences and their suggested duplication model have

recently received much interest, providing several new

computational biology problems and challenges [11], [12].

The main challenge consists of creating algorithms

incorporating the model constraints to reconstruct the


. The authors are with Projet Me thodes et Algorithmes pour la Bioinforma-tique, LIRMM (UMR 5506, CNRSUniv. Montpellier 2), 161 rue Ada,34392 Montpellier Cedex 5France. E-mail: [email protected].

Manuscript received 11 Oct. 2004; revised 17 Dec. 2004; accepted 20 Dec.2004; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:

[email protected], and reference IEEECS Log Number TCBBSI-0170-1004.1545-5963/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM


16/78

duplication history of tandemly repeated sequences.

Indeed, accurate reconstruction of duplication histories

will be useful to elucidate various aspects of genome

evolution. They will provide new insights into the

mechanisms and determinants of gene and protein domain

duplication, often recognized as major generators of

novelty [13]. Several important gene families, such as

immunity-related genes, are arranged in tandem; better

understanding their evolution should provide new insights

into their duplication dynamics and clues about their

functional specialization. Studying the evolution of micro

and minisatellites could resolve unanswered biologicalquestions regarding human migrations or the evolution of

bacterial diseases [14].

Given a set of aligned and ordered sequences (DNA or

proteins), the aim is to find the duplication tree that best

explains these sequences, according to usual criteria in

phylogenetics, e.g., parsimony or minimum evolution. Few

studies have focused on the computational hardness of this

problem, and all of these studies only deal with the

restricted version where simultaneous duplication of multi-

ple adjacent segments is not allowed. In this context, Jaitly

et al. [15] shows that finding the optimal single copy

duplication tree with parsimony is NP-Hard and that this

problem has a PTAS (Polynomial Time Approximation

Scheme). Another closely related PTAS is given by Tang

et al. [10] for the same problem. On the other hand,

Elemento et al. [7] describes a polynomial distance-based

algorithm that reconstructs optimal single copy tandem

duplication trees with minimum evolution.

However, it is commonly believed, as in phylogeny, that

most (especially multiple) duplication tree inference pro-

blems are NP-Hard. This explains the development of

heuristic approaches. Benson and Dong [9] provides various

parsimony-basedheuristic reconstruction algorithms to infer

duplication trees, especially from minisatellites. Elemento

et al. [8] present an enumerative algorithm that computes the

most parsimonious duplication tree; this algorithm (by its

exhaustive approach) is limited to datasets of less than 15

repeats. Several distance-based methods have also been

described.The WINDOW method [10] usesan agglomeration

scheme similar to UPGMA [16] and NJ [17], but the cost

function used to judge potential duplication is based on the

assumptionthat thesequencesfollow a molecular clockmode

of evolution. The DTSCORE method [18] uses the same

schemebut corrects this limitation using a score criterion[19],

like ADDTREE [20]. DTSCORE can be used with sequences

that do not follow the molecular clock, which is, for example,

essential when dealing with gene families containing

pseudogenes that evolve much faster than functional genes.

Finally, GREEDY SEARCH [21] corresponds to a different

approach divided into two steps: First, a phylogeny is

computed with a classical reconstruction method (NJ), then,

with nearest neighbor interchange (NNI) rearrangements, a

duplication tree close to this phylogeny is computed. This

approach is noteworthy since it implements topological

rearrangements which are highly useful in phylogenetics

[22], but it works blindly and does not ensure that goodduplication trees will be found (cf. Section 5.2).

Topological rearrangements have an essential function in

phylogenetic inference, where they are used to improve an

initial phylogeny by subtree movement or exchange.

Rearrangements are very useful for all common criteria

(parsimony, distance, maximum likelihood) and are inte-

grated into all classical programs like PAUP* [23] or

PHYLIP [24]. Furthermore, they are used to define various

distances between phylogenies and are the foundation of

much mathematical work [25]. Unfortunately, they cannot

be directly used here, as shown by a simple example given


Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].

(b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In

both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.


17/78

later. Indeed, when applied to a duplication tree, they do

not guarantee that another valid duplication tree will be

produced.

In this paper, we describe a set of topological rearrange-

ments to stay inside the duplication tree space and explore

the whole space from any of its elements. We then show the

advantages of this approach for duplication tree inference

from sequences. In Section 2, we describe the duplication

model introduced by [4], [8], [10], as well as an algorithm to

recognize duplication trees in linear time. Thanks to this

algorithm, we restrict the neighborhoods defined by

classical phylogeny rearrangements, namely, nearest neigh-

bor interchange (NNI) and subtree pruning and regrafting

(SPR), to valid duplication trees. We demonstrate (Section 3)

that for NNI moves this restricted neighborhood does not

allow the exploration of the whole duplication tree space.

On the other hand, we demonstrate that the restricted

neighborhood of SPR rearrangement allows the whole

space to be explored. In this way, we define a local search

method, applied here to parsimony and minimum evolu-

tion (Section 4). We compare this method to other existing

approaches using simulated and real data sets (Section 5).

We conclude by discussing the positive results obtained by

our method, and indicate directions for further research

(Section 6).

2 MODEL

2.1 Duplication History and Duplication Tree

The tandem duplication model used in this article was first

introduced by Fitch [4] then studied independently by [8],

[10]. It is based on unequal recombination which is assumed

to be the sole evolution mechanism (except point mutations)

acting on sequences. Although it is a completely different

biological mechanism, slipped-strand mispairing leads to

the same duplication model [5], [9].

Let O 1; 2; . . . ; n be the ordered set of sequences

representing the extant locus. Initially containing a single

copy, the locus grew through a series of consecutive

duplications. As shown in Fig. 2a, a duplication history

may contain simple duplication events. When the dupli-

cated fragment contains two, three, or k repeats, we say that

it involves a multiple duplication event. Under this

duplication model, a duplication history is a rooted tree

with n labeled and ordered leaves, in which internal nodes

of degree 3 correspond to duplication events. In a real

duplication history (Fig. 2a), the time intervals between

consecutive duplications are completely known, and the

internal nodes are ordered from top to bottom according to

the moment they occurred in the course of evolution. Anyordered segment set of the same height then represents an

ancestral state of the locus. We call such a set a floor, and

we say that two nodes i; j are adjacent (i 0 j) if there is a

floor where i and j are consecutive and i is on the left of j.

However, in the absence of a molecular clock mode of

evolution (a typical problem), it is impossible to recover the

order between the duplication events of two different

lineages from the sequences. In this case, we are only able to

infer a duplication tree (DT) (Fig. 2b) or a rooted

duplication tree (RDT) (Fig. 2c).

A duplication tree is an unrooted phylogeny with

ordered leaves, whose topology is compatible with at least

one duplication history. Also, internal nodes of duplication

trees are partitioned into events (or blocks following

[10]), each containing one or more (ordered) nodes. We

distinguish simple duplication events that contain a

unique internal node (e.g., b and f in Fig. 2c) and multiple

duplication events which group a series of adjacent and

simultaneous duplications (e.g., c, d, and e in Fig. 2c). Let

E si; si1; . . . ; sk denote an event containing internal

nodes si; si1; . . . ; sk in left to right order. We say that two

consecutive nodes of the same event are adjacent (sj 0 sj1)

just like in histories, as any event belongs to a floor in all of

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17

Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the

possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position 1 on (b).


18/78

the histories that are compatible with the DT being

considered. The same notation will also be used for leaves

to express the segment order in the extant locus. When the

tree is rooted, every internal node sj is unambiguously

associated to one parent and two child nodes; moreover,

one child ofsj is left and the other one is right, which is

denoted as lj and rj, respectively. In this case, for anyduplication history that is compatible with this tree, child

nodes of an event, si; si1; . . . ; sk are organized as follows:

li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk:

In [8] , [26], [27], i t w as shown t hat r ooting a

duplication tree is different than rooting a phylogeny:

the root of a duplication tree necessarily lies on the tree

path between the most distant repeats on the locus, i.e., 1

and n; moreover, the root is always located above all

multiple duplications, e.g., Fig. 1b shows that there are

only three valid root positions, the root cannot be a direct

ancestor of 12.

2.2 Recursive Definition of Rooted and UnrootedDuplication Trees

A duplication tree is compatible with at least one duplica-

tion history. This suggests a recursive definition, which

progressively reconstructs a possible history, given a

phylogeny T and a leaf ordering O. We define a cherry

l;s;r as a pair of leaves (l and r) separated by a single

node s in T, and we call CT the set of cherries of T. This

recursive definition reverses evolution: It searches for a

visible duplication event, agglomerates this event, and

checks whether the reduced tree is a duplication tree. Incase of rooted trees, we have:

T ; O defines a duplication tree with root if and only if:1. T ; O only contains , or

2. there is in CT a series of cherriesli; si; ri; li1; si1; ri1; . . . ; lk; sk; rk

with k ! i and

li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk in O, suchthat T; O defines a duplication tree with root ,where T is obtained from T by removing

li; li1; . . . ; lk; ri; ri1; . . . ; rk, and O is obtained by

replacing li; li1; . . . ; lk; ri; ri1; . . . ; rk by

si; si1

;. . .

; sk in O.

The definition for unrooted trees is quite similar:

T ; O defines an unrooted duplication tree if and only if:1. T ; O contains 1 segment, or2. same as for rooted trees with T; O now defining an

unrooted duplication tree.

Those definitions provide a recursive algorithm, RADT

(Recognition Algorithm for Duplication Trees), to check

whether any given phylogeny with ordered leaves is a

duplication tree. In case of success, this algorithm can also

be used to reconstruct duplication events: At each step, the

series of internal nodes above denoted as si; si1; . . . ; sk is

a duplication event. When the tree is rooted, lj is the left

child of sj and rj its right child, for every j; i j k. This

algorithm can be implemented in On [26] where n is the

number of leaves. Another linear algorithm is proposed by

Zhang et al. [21] using a top down approach instead of a

bottom-up one, but applies only to rooted duplication trees.

3 TOPOLOGICAL REARRANGEMENTS FORDUPLICATION TREES

This section shows how to explore the DT space using SPR

rearrangements. First, we describe some NNI, SPR, and

TBR rearrangement properties with standard phylogenies.

But, these rearrangements cannot be directly used to

explore the DT space. Indeed, when applied to a duplica-

tion tree, they do not guarantee that another valid

duplication tree will be produced. So, we have decided to

restrict the neighborhood defined by those rearrangements

to duplication trees. If we only used NNI rearrangements,

the neighborhood would be too restricted (as shown by a

simple example) and would not allow the whole DT space

to be explored. On the other hand, we can distinguish two

types of SPR rearrangements which, when applied to a

rooted duplication tree guarantee that another valid

duplication tree will be produced. Thanks to these specific

rearrangements, we demonstrate that restricting the neigh-

borhood of SPR rearrangements allows the whole space of

duplication trees to be explored.


Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose RT is a rooted version; T is obtained by

applying NNI(5,4) around the bold edge; none of the possible root positions of T (a, b, c, and d) leads to a valid RDT, cf. tree (b) which

corresponds to root b in T.


19/78

3.1 Topological Rearrangements for Phylogeny

There are many ways of carrying out topological rearrange-

ments on phylogeny [22]. We only describe NNI (Nearest

Neighbor Interchange), SPR (Subtree Pruning Regrafting),

and TBR (Tree Bisection and Reconnection) rearrangements.

The NNI move is a simple rearrangement which

exchanges two subtrees adjacent to the same internal edge(Figs. 3 and 4). There are two possible NNIs for each

internal edge, so 2n 3 neighboring trees for one tree

with n leaves. This rearrangement allows the whole space of

phylogeny to be explored; i.e., there is a succession of NNI

moves making it possible to transform any phylogeny P1into any phylogeny P2 [28].

The SPR move consists of pruning a subtree and

regrafting it, by its root, to an edge of the resulting tree

(Figs. 6 and 7). We note that the neighborhood of a tree

defined by the NNI rearrangements is included in the

neighborhood defined by SPRs. The latter rearrangement

defines a neighborhood of size 2n 32n 7 [25].Finally, TBR generalizes SPR by allowing the pruned

subtree to be reconnected by any of its edges to the resulting

tree. These three rearrangements (NNI, SPR, and TBR) are

reversible, that is, if T is obtained from T by a particular

rearrangement, then T can be obtained from T using the

same type of rearrangement.

3.2 NNI Rearrangements Do Not Stay in DT Space

The classical phylogenetic rearrangements (NNI, SPR,

TBR,...) do not always stay in DT space. So, if we apply

an NNI to a DT (e.g., Fig. 3), the resulting tree is not always

a valid DT. This property is also true for SPR and TBRrearrangements since NNI rearrangements are included in

these two rearrangement classes.

3.3 Restricted NNI Does Not Allow the Whole DTSpace to Be Explored

To restrict the neighborhood defined by NNI rearrange-

ments to duplication trees, each element of the neighbor-

hood is filtered thanks to the recognition algorithm (RADT).

But, this restricted neighborhood does not allow the whole

DT space to be explored. Fig. 4 gives an example of a

duplication tree, T, the neighborhood of which does not

contain any DT. So, its restricted neighborhood is empty,

and there is no succession of restricted NNIs allowing T to

be transformed into any other DT.

3.4 Restricted SPR Allows the Whole DT Space toBe Explored

As before, we restrict (using RADT) the neighborhood

defined by SPR rearrangements to duplication trees. We

name restricted SPR, SPR moves that, starting from a

duplication tree, lead to another duplication tree.

Main Theorem. Let T1 and T2 be any given duplication trees; T1

can be transformed into T2 via a succession of restricted SPRs.

Proof. To demonstrate the Main Theorem, we define two

types of special SPR that ensure staying within the space

of rooted duplication trees (RDT). Given these two types

of SPRs, we demonstrate that it is possible to transform

any rooted duplication tree into a caterpillar, i.e., a

rooted tree in which all internal nodes belong to the tree

path between the leaf1 and the tree root (cf. Fig. 5).

This result demonstrates the theorem. Indeed, let T1and T2 be two RDTs. We can transform T1 and T2 into a

caterpillar by a succession of restricted SPRs. So, it is

possible to transform T1 into T2 by a succession of

restricted SPRs, with (possibly) a caterpillar as inter-

mediate tree. This property holds since the reciprocal

movement of an SPR is an SPR. As the two SPR types

proposed ensure that we stay within the RDTs space, we

have the desired result for rooted duplication trees. And,

this result extends to unrooted duplications trees since

two DTs can be arbitrarily rooted, transformed from one

to the other using restricted SPRs, then unrooted. tu

The first special SPR allows multiple duplication

events to be destroyed. Let E si; si1; . . . ; sk be a

duplication event, ri and lk respectively right child of si

BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 19

Fig. 5. A six-leaf caterpillar.

Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose RT is a rooted version; T is obtained by

exchanging subtrees 1 and (2 5); none of the possible root positions of T (a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds

to root b in T; and the same holds for every neighbor of T being obtained by NNI.


20/78

and left child of sk, and let pi be the father of si. The

DELETE rearrangement consists of pruning the subtree of

root ri and grafting this subtree on the edge sk; lk, while

li is renamed si and the edge li; si is deleted. Fig. 6

demonstrates this rearrangement.

Lemma 1. DELETE preserves the RDT property.

Proof. Let T be the initial tree (Fig. 6a), E si; si1; . . . ; sk

be an event of T, and T be the tree obtained from T by

applying DELETE to E (Fig. 6b). Children of any node sj(i j k) are denoted lj and rj.

By definition, for any duplication history compatible

with T we have

li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk:

Thus, there is a way to partially agglomerate T (using an

RADT-like procedure) such that these nodes becomes

leaves. The same agglomeration can be applied to T as

only ancestors of the ljs and rjs are affected by DELETE.

Now, 1) agglomerate the event E of T, and 2) reduce T

by agglomerating the cherry lk; ri and then agglomer-

ating the

-IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number...

Documents

Transcript of -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number...