-IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number...

download -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

of 78

Transcript of -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number...

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    1/78

    Guest Editorial: WABI Special Section Part llJunhyong Kim and Inge Jonassen

    THE Fourth International Workshop on Algorithms inBIoinformatics (WABI) 2004 was held in Bergen, Nor-way, September 2004. The program committee consisted of33 members and selected, among 117 submissions, 39 to bepresented at the workshop and included in the proceedingsfrom the workshop (volume 3240 of Lecture Notes inBioinformatics, series edited by Sorin Istrail, Pavel Pevzner,and Michael Waterman).

    The WABI 2004 program committee selected a small

    number of papers among the 39 to be invited to submit

    extended versions of their papers to a special section of the

    IEEE/ACM Transactions on Computational Biology and Bioin-

    formatics. Four papers were published in the October-

    December 2004 issue of the journal and this issue contains

    an additional three papers. We would like to thank both the

    entire program committee for WABI and the reviewers of

    the papers in this issue for their valuable contributions.The first of the papers is A New Distance for High Level

    RNA Secondary Structure Comparison authored by Julien

    Allali and Marie-France Sagot. This paper describes algo-

    rithms for comparing secondarystructures of RNAmolecules

    wherethe structures arerepresentedby trees.The problem of

    classifying RNA secondary structure is becoming critical as

    biologists are discovering more and more noncoding func-

    tional elements in the genome (e.g., miRNA). Most likely, themajor functional determinants of the elements are their

    secondary structure and, therefore, a metric between such

    secondary structures will also help delineate clusters of

    functional groups. In Allali and Sagots paper, two tree

    representations of secondary structure are compared by

    analysing how one tree can be transformed into the other

    using an allowed set of operations. Each operation can be

    associated with a cost and thedistance between two trees can

    then be defined as the minimum cost associated with a

    transform of one tree to the other. Allali and Sagot introduce

    two new operations that they name edge fusion and node

    fusion and show that these alleviate limitations associated

    with the classical tree edit operations used for RNAcomparison. Importantly, they also present algorithms for

    calculating the distance between trees allowing the new

    operations in addition to the classical ones, and analyze the

    performance of the algorithms.

    The second paper is Topological Rearrangements andLocal Search Method for Tandem Duplication Trees and isauthored by Denis Bertrand and Olivier Gascuel. The paperapproaches the problem of estimating the evolutionaryhistory of tandem repeats. A tandem repeat is a stretch ofDNA sequence that contains an element that is repeatedmultiple times and where the repeat occurrences are next toeach other in the sequence. Since the repeats are subject tomutations, they are not identical. Therefore, tandem repeatsoccur through evolution by copying (duplication) ofrepeat elements in blocks of varying size. Bertrand andGascuel address the problem of finding the most likelysequence of events giving rise to the observed set of repeats.Each sequence of events can be described by a duplicationtree and one searches for the tree that is the mostparsimonious, i.e., one that explains how the sequence hasevolved from an ancestral single copy with a minimumnumber of mutations along the branches of the tree. Themain difference with the standard phylogeny problem isthat linear ordering of the tandem duplications imposeconstraints the possible binary tree form. This paperdescribes a local search method that allows exploration ofthe complete space of possible duplication trees and showsthat the method is superior to other existing methods forreconstructing the tree and recovering its duplication

    events.The third paper is Optimizing Multiple Seeds for

    Homology Search authored by Daniel G. Brown. Thepaper presents an approach to selecting starting points forpairwise local alignments of protein sequences. Theproblem of pairwise local alignment is to find a segmentfrom each so that the two local segments can be aligned toobtain a high score. For commonly used scoring schemes,this can be solved exactly using dynamic programming.However, pairwise alignment is frequently applied to largedata sets and heuristic methods for restricting alignments tobe considered are frequently used, for instance, in theBLAST programs. The key is to restrict the number of

    alignments as much as possible, by choosing a few goodseeds, without missing high scoring alignments. The papershows that this can be formulated as an integer program-ming problem and presents algorithm for choosing optimalseeds. Analysis is presented showing that the approachgives four times fewer false positives (unnecessary seeds) incomparison with BLASTP without losing more good hits.

    Junhyong KimInge Jonassen

    Guest Editors

    IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 1

    . J. Kim is with the Department of Biology, University of Pennsylvania,3451 Walnut Street, Philadelphia, PA 19104.E-mail: [email protected].

    . I. Jonassen is with the Department of Informatics and ComputationalBiology Unit, University of Bergen, HIB N5020 Bergen, Norway.E-mail: [email protected].

    For information on obtaining reprints of this article, please send e-mail to:

    [email protected]/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    2/78

    Junhyong Kim is the Edmund J. and LouiseKahn Term Endowed Professor in the Depart-mentof Biologyat the University of Pennsylvania.He holds joint appointments in the Department ofComputerand Information Science, Penn Centerfor Bioinformatics, and the Penn GenomicsInstitute. He serves on the editorial board ofMolecular Development and Evolution and theIEEE/ACM Transactions on Computational Biol-

    ogy and Bioinformatics, thecouncilof theSocietyfor Systematic Biology, and the executive committee of the CyberInfrastructure for Phylogenetics Research. His research focuses oncomputational and experimental approaches to comparative develop-ment. The current focus of his lab is in three areas: computationalphylogenetics, in silico gene discovery, and comparative developmentusing genome-wide gene expression data.

    Inge Jonassen is a professor of computerscience in the Department of Informatics at theUniversity of Bergen in Norway, where he ismember of the bioinformatics group. He is alsoaffiliated with the Bergen Center for Computa-tional Science at the same university where heheads the Computational Biology Unit. He is alsovice president of the Society for Bioinformatics inthe Nordic Countries (SocBiN) and a member of

    the board of the Nordic Bioinformatics Network.He coordinates the technology platform for bioinformatics funded by theNorwegian Research Council functional genomics programme FUGE.He has worked in the field of bioinformatics since the early 1990s, wherehe has primarily focused on methods for discovery of patterns withapplications to biological sequences and structures and on methods forthe analysis of microarray gene expression data.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    3/78

    A New Distance for High Level RNASecondary Structure Comparison

    Julien Allali and Marie-France Sagot

    AbstractWe describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new

    operations, called node fusionand edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in

    the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent

    RNAs and what is searchedfor is a commonstructuralcoreof twoRNAs. Althoughthe algorithmcomplexity hasan exponential term, this

    term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The

    algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.

    Index TermsTree comparison, edit operation, distance, RNA, secondary structure.

    1 INTRODUCTION

    RNAS are one of the fundamental elements of a cell. Theirrole in regulation has been recently shown to be far

    more prominent than initially believed (20 December 2002issue of Science, which designated small RNAs withregulatory function as the scientific breakthrough of theyear). It is now known, for instance, that there is massivetranscription of noncoding RNAs. Yet current mathematicaland computer tools remain mostly inadequate to identify,

    analyze, and compare RNAs.An RNA may be seen as a string over the alphabet of

    nucleotides (also called bases), {A, C, G, T}. Inside a cell,RNAs do not retain a linear form, but instead fold in space.The fold is given by the set of nucleotide bases that pair. The

    main type of pairing, called canonical, corresponds to bondsof the type A U and G C. Other rarer types of bondsmay be observed, the most frequent among them is G U,also called the wobble pair. Fig. 1 shows the sequence of afolded RNA. Each box represents a consecutive sequence ofbonded pairs, corresponding to a helix in 3D space. Thesecondary structure of an RNA is the set of helices (or thelist of paired bases) making up the RNA. Pseudoknots,which may be described as a pair of interleaved helices, arein general excluded from the secondary structure of anRNA. RNA secondary structures can thus be represented asplanar graphs. An RNA primary structure is its sequence ofnucleotides while its tertiary structure corresponds to thegeometric form the RNA adopts in space.

    Apart from helices, the other main structural elements in

    an RNA are:

    1. hairpin loops which are sequences of unpaired basesclosing a helix;

    2. internal loops which are sequences of unpairedbases linking two different helices;

    3. bulges which are internal loops with unpaired baseson one side only of a helix;

    4. multiloops which are unpaired bases linking at leastthree helices.

    Stems are successions of one or more among helices,

    internal loops, and/or bulges.The comparison of RNA secondary structures is one of

    the main basic computational problems raised by the study

    of RNAs. It is the problem we address in this paper. The

    motivations are many. RNA structure comparison has been

    used in at least one approach to RNA structure prediction

    that takes as initial data a set of unaligned sequences

    supposed to have a common structural core [1]. For each

    sequence, a set of structural predictions are made (for

    instance, all suboptimal structures predicted by an algo-

    rithm like Zuckers MFOLD [15], or all suboptimal sets of

    compatible helices or stems). The common structure is then

    found by comparing all the structures obtained from the

    initial set of sequences, and identifying a substructure

    common to all, or to some of the sequences. RNA structure

    comparison is also an essential element in the discovery ofRNA structural motifs, or profiles, or of more general

    models that may then be used to search for other RNAs of

    the same type in newly sequenced genomes. For instance,

    general models for tRNAs and introns of group I have been

    derived by hand [3], [10]. It is an open question whether

    models at least as accurate as these, or perhaps even more

    accurate, could have been derived in an automatic way. The

    identification of smaller structural motifs is an equally

    important topic that requires comparing structures.As we saw, the comparison of RNA structures may

    concern known RNA structures (that is, structures that were

    experimentally determined) or predicted structures. The

    IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 3

    . J. Allali is with the Institut Gaspard-Monge, Universite de Marne-la-Valle e, CiteDescartes, Champs-sur-Marne, 77454, Marne-la-Vallee Cedex2, France. E-mail: [email protected].

    . M.-F. Sagot is with Inria Rhone-Alpes, UniversiteClaude Bernard, Lyon I,43 Bd du Novembre 1918, 69622 Villeurbanne cedex, France.E-mail: [email protected].

    Manuscript received 11 Oct. 2004; accepted 20 Dec. 2004; published online30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:

    [email protected], and reference IEEECS Log Number TCBB-0164-1004.1545-5963/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    4/78

    objective in both cases is the same: to find the common parts

    of such structures.

    In [11], Shapiro suggested to mathematically model RNA

    secondary structures without pseudoknots by means of

    trees. The trees are rooted and ordered, which means that

    the order among the children of a node matters. This order

    corresponds to the 5-3 orientation of an RNA sequence.

    Given two trees representing each an RNA, there are two

    main ways for comparing them. One is based on the

    computation of the edit distance between the two trees

    while the other consists in aligning the trees and using the

    score of the alignment as a measure of the distance between

    the trees. Contrary to what happens with sequences, the

    two, alignment and edit distance, are not equivalent. The

    alignment distance is a restrained form of the edit distance

    between two trees, where all insertions must be performed

    before any deletions. The alignment distance for general

    trees was defined in 1994 by Jiang et al. in [9] and extended

    to an alignment distance between forests in [6]. Morerecently, Hochsmann et al. [7] applied the tree alignment

    distance to the comparison of two RNA secondary

    structures. Because of the restriction on the way edit

    operations can be applied in an alignment, we are not

    concerned in this paper with tree alignment distance and

    we therefore address exclusively from now on the problem

    of tree edit distance.

    Our way for comparing two RNA secondary structures is

    thentoapplyanumberoftreeeditoperationsinoneorbothof

    the trees representing the RNAs until isomorphic trees are

    obtained. The currently most popular program using this

    approach is probably theVienna package [5],[4]. Thetreeeditoperations considered are derived from the operations

    classically applied to sequences [13]: substitution, deletion,

    andinsertion. In 1989, Zhang andShasha [14] gave a dynamic

    programming algorithm for comparing two trees. Shapiro

    and Zhang then showed [12] how to use tree editing to

    compare RNAs. The latter also proposed various tree models

    that could be used for representing RNA secondary struc-

    tures. Each suggested tree offers a more or less detailed view

    of an RNA structure. Figs. 2b, 2c, 2d, and 2e present a few

    examples of such possible views for the RNA given in Fig. 2a.

    In Fig. 2, the nodes of the tree in Fig. 2b represent either

    unpaired bases (leaves) or paired bases (internal nodes). Each

    node is labeled with, respectively, a base or a pair of bases. A

    node of the tree in Fig. 2c represents a set of successive

    unpaired bases or of stacked paired ones. The label of a node

    is an integer indicating, respectively, thenumber of unpaired

    bases or the height of the stack of paired ones. The nodes of the

    tree in Fig. 2d represent elements of secondary structure:

    hairpin loop (H), bulge (B), internal loop (I),or multiloop (M).

    The edges correspond to helices. Finally, the tree in Fig. 2e

    contains only the information concerning the skeleton of

    multiloops of an RNA. Thelast representation,though giving

    a highlysimplified view of an RNA, is importantnevertheless

    as it is generally accepted that it is this skeleton which is

    usually the most constrained part of an RNA. The last two

    models may be enriched with information concerning, for

    instance, the number of (unpaired) bases in a loop (hairpin,

    internal, multi) or bulge, and the number of paired bases in a

    helix.The first label thenodesof thetree, thesecond itsedges.

    Other types of information may be added (such as overall

    composition of the elements of secondary structure). In fact,

    one could consider working with various representations

    simultaneously or in an interlocked, multilevel fashion. This

    goes beyond the scope of this paper which is concerned with

    comparing RNA secondary structures using any one among

    the many tree representations possible. We shall, however,

    comment further on this multilevel approach later on.

    Concerning the objectives of this paper, they are twofold.

    The first is to give some indications on why the classical edit

    operations that have been considered so far in the literature

    for comparing trees present some limitations when the trees

    stand for RNA structures. Three cases of such limitations will

    be illustrated through examples in Section 3. In Section 4, we

    then introduce two novel operations, so-called node-fusion

    and edge-fusion, that enable us to address some of these

    limitations and then give a dynamic programming algorithm

    for comparing twoRNA structures with these twoadditional

    operations. Implementation issues and initial results are

    presentedin Section 4. In Section 5, we give a first application

    4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

    Fig. 1. Primary and secondary structures of a transfer RNA.

    Fig. 2. Example of different tree representations ((b), (c), (d), and (e)) of

    the same RNA (a).

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    5/78

    of our algorithm to the comparison of two RNA secondary

    structures. Finally, in Section 6, we sketch the main ideas

    behind the multilevel RNA comparison approach mentioned

    above.Before that, we start by introducing some notation and

    by recalling in the next section the basics about classical tree

    edit operations and tree mapping.

    This paper is an extended version of a paper presented at

    the Workshop on Algorithms in BioInformatics (WABI) in

    2004, in Bergen, Norway. A few more examples are given to

    illustrate some of the points made in the WABI paper,

    complexity and implementation issues are discussed in

    more depth as are the cost functions and a multilevel

    approach to comparing RNAs.

    2 TREE EDITING AND MAPPING

    Let T be an ordered rooted tree, that is, a tree where the

    order among the children of a node matters. We define

    three kinds of operations on T: deletion, insertion, and

    relabeling (corresponding to a substitution in sequencecomparison). The operations are shown in Fig. 3. The

    deletion (Fig. 3b) of a node u removes u from the tree. The

    children ofu become the children ofus father. An insertion

    (Fig. 3c) is the symmetric of a deletion. Given a node u, we

    remove a consecutive (in relation to the order among the

    children) set u1; . . . ; up of its children, create a new node v,

    make v a child ofu by attaching it at the place where the set

    was, and, finally, make the set u1; . . . ; up (in the same order)

    the children of v. The relabeling of a node (Fig. 3d) consists

    simply in changing its label.

    Given two trees T and T0, we define S fs1 . . . seg to be

    a series of edit operations such that, if we apply succes-

    sively the operations in S to the tree T, we obtain T0 (i.e., T

    and T0 become isomorphic). A series of operations like S

    realizes the editing of T into T0 and is denoted by T S

    T0.

    We define a function cost from the set of possible edit

    operations (deletion, insertion, relabeling) to the integers (or

    the reals) such that costs is the score of the edit operation s.

    IfSis a series of edit operations, we define by extension that

    costS isP

    s2Scosts. We can define the edit distance between

    two trees as the series of operations that performs the

    editing of T into T0 and such that its cost is minimal:

    distanceT ; T

    0

    fmincostSjT

    S

    T

    0

    g.

    Let an insertion or a deletion cost one and the relabeling of

    a node cost zero if the label is the sameand one otherwise.For

    the two trees of the figure on the left, the series relabelA

    F:deleteB:insertG realizes the editing of the left tree into

    the right one and costs 3. Another possibility is the series

    deleteB:relabelA G:insertF which also costs 3. The

    distance between these two trees is 3.

    Given a series of operations S, let us consider the nodes

    of T that are not deleted (in the initial tree or after some

    relabeling). Such nodes are associated with nodes of T0. The

    mapping MS relative to S is the set of couples u; u0 with

    u 2 T and u0 2 T0 such that u is associated with u0 by S.

    The operations described above are the classical tree editoperations that have been commonly used in the literature

    for RNA secondary structure comparison. We now present a

    few results obtained using such classical operations that will

    allow us to illustratea fewlimitations they maypresent when

    used for comparing RNA structures.

    3 LIMITATIONS OF CLASSICAL TREE EDITOPERATIONS FOR RNA COMPARISON

    As suggested in [12], the tree edit operations recalled in the

    previous section can be used on any type of tree coding of

    an RNA secondary structure.Fig. 4 shows two RNAsePs extracted from the database [2]

    (they are found, respectively, in Streptococcus gordonii and

    Thermotoga maritima). For the example we discuss now, we

    code the RNAs using the tree representation indicated in

    Fig. 2b where a node represents a base pair and a leaf an

    unpaired base. After applying a few edit operations to the

    trees, we obtain the result indicated in Fig. 4, with deleted/

    insertedbasesingray.Wehavesurroundedafewregionsthat

    match in the two trees. Bases in the rectangular box at the

    bottom of the RNA on the left are thus associated with bases in

    thebottom rightmostrectangular boxof theRNA on theright.

    The same is observed for the bases in the oval boxes for bothRNAs. Such matches illustrateone of themain problems with

    the classical tree edit operations: Bases in one RNA may be

    mapped to identically labeled bases in the other RNA to

    minimise the total cost, while such bases should not be

    associated in terms of the elements of secondary structure to

    which they belong. In fact, such elements are often distant

    from one another along the common RNA structure. We call

    this problem the scattering effect. It is related to the

    definition of tree edit operations. In the case of this example

    and of the representation adopted, the problem might have

    been avoided if structural information had been used.

    Indeed, the problem appears also because the structural

    ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 5

    Fig. 3. Edit operations: (a) the original tree T, (b) deletion of the node

    labelled D, (c) insertion of the node labeled I, and (d) relabeling of a

    node in T (the label A of the root is changed into K).

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    6/78

    location of an unpaired base is not taken into account. It is

    therefore possible to match, for instance, an unpaired basefrom a hairpin loop with an unpaired base from a multiloop.

    Using another type of representation, as we shall do, would,

    however, not be enough to solve all problems as we see next.

    Indeed, to compare the same two RNAs, we can also use a

    more abstract tree representation such as the one given in

    Fig. 2d. In this case, the internal nodes represent a multiloop,

    internal-loop, or bulge, the leaves code for hairpin loops and

    edges forhelices. Theresult of theeditionofTinto T0 forsome

    cost function is presented in Fig. 5 (weshallcome back later to

    the cost functions used in the case of such more abstract RNA

    representations; for the sake of this example, we may assume

    an arbitrary one is used).The problem we wish to illustrate in this case is shown

    by the boxes in the figure. Consider the boxes at the bottom.

    In the left RNA, we have a helix made up of 13 base pairs. In

    the right RNA, the helix is formed by seven base pairs

    followed by an internal loop and another helix of size 5. By

    definition (see Section 2), the algorithm can only associateone element in the first tree to one element in the second

    tree. In this case, we would like to associate the helix of the

    left tree to the two helices of the second tree since it seems

    clear that the internal loop represents either an inserted

    element in the second RNA, or the unbonding of one base

    pair. This, however, is not possible with classical edit

    operations.

    A third type of problem one can meet when using only

    the three classical edit operations to compare trees standing

    for RNAs is similar to the previous one, but concerns this

    time a node instead of edges in the same tree representa-

    tion. Often, an RNA may present a very small helix betweentwo elements (multiloop, internal-loop, bulge, or hairpin-

    loop) while such helix is absent in the other RNA. In this

    case, we would therefore have liked to be able to associate

    one node in a tree representing an RNA with two or more

    6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

    Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarumand of

    Saccharomyces kluveri, using the model given in Fig. 2d.

    Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgoniiand of Thermotoga maritima,

    using the model given in Fig. 2b.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    7/78

    nodes in the tree for the other RNA. Once again, this is not

    possible with any of the classical tree edit operations. An

    illustration of this problem is shown in Fig. 6.We shall use RNA representations that take the elements

    of the structure of an RNA into account to avoid some of the

    scattering effect. Furthermore, in addition to considering

    information of a structural nature, labels are attached, in

    general, to both nodes and edges of the tree representing an

    RNA. Such labels are numerical values (integers or reals).

    They represent in most cases the size of the correspondingelement, but may also further indicate its composition, etc.

    Such additional information is then incorporated into the

    cost functions for all three edit operations. It is important to

    observe that when dealing with trees labeled at both the

    nodes and edges, any node and the edge that leads to it (or,

    in an alternative perspective, departs from it) represent a

    single object from the point of view of computing an edit

    distance between the trees.It remains now to deal with the last two problems that

    are a consequence of the one-to-one associations between

    nodes and edges enforced by the classical tree edit

    operations. To that purpose, we introduce two novel tree

    edit operations, called the edge fusion and the node fusion.

    4 INTRODUCING NOVEL TREE EDIT OPERATIONS

    4.1 Edge Fusion and Node Fusion

    In order to address some of thelimitations of theclassical tree

    edit operations that were illustrated in the previous section,

    we need to introduce two novel operations. These arethe edge

    fusion and the node fusion. They may be applied to any of the

    tree representations given in Figs. 2c, 2d, and 2e.

    An example of edge fusion is shown in Fig. 7a. Let eu bean

    edge leading to a node u, ci a child of u and eci the edge

    between u and ci. The edge fusion of eu and eci consists in

    replacing eci and eu with a new single edge e. The edge e links

    the father of u to ci. Its label then becomes a function of the

    (numerical) labels ofeu, u and eci . For instance, if such labels

    indicatedthe size of each element (e.g.,for a helix,the number

    ofitsstackedpairs,andforaloop,the min , max ortheaverage

    of its unpaired bases on each side of the loop), the label of e

    could be the sum of the sizes of eu, u and eci . Observe that

    merging two edges implies deleting all subtrees rooted at the

    children cj ofu forj different fromi. Thecost of such deletions

    is added to the cost of the edge fusion.An example of node fusion is given in Fig. 7b. Let u be a

    node and ci one of its children. Performing a node fusion of

    u and ci consists in making u the father of all children of ciand in relabeling u with a value that is a function of thevalues of the labels of u, ci and of the edge between them.

    Observe that a node fusion may be simulated using the

    classical edit operations by a deletion followed by a

    relabeling. However, the difference between a node fusion

    and a deletion/relabeling is in the cost associated with both

    operations. We shall come back to this point later.Obviously, like insertions or deletions, edge fusions and

    node fusions have of course symmetric counterparts, which

    are the edge split and the node split.Given two rooted, ordered, and labeled trees T and T0,

    we define the edit distance with fusion between T and T0

    ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 7

    Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.

    Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and

    Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    8/78

    as distancefusionT ; T0 fmincostSjT

    ST0g with costs the

    cost associated to each of the seven edit operations nowconsidered (relabeling, insertion, deletion, node fusion andsplit, edge fusion and split).

    Proposition 1. If the following is verified:

    . costmatcha; b is a distance,

    . costinsa costdela ! 0,

    . costnodefusion a;b;c costnodesplit a;b;c ! 0, and

    . costedgefusion a;b;c costedgesplit a;b;c ! 0,

    then distancefusion is indeed a distance.

    Proof. The positiveness of distancefusion is given by the factthat all elementary cost functions are positive. Its

    symmetry is guaranteed by the symmetry in the costsof the insertion/deletion and (node/edge) fusion/split

    operations. Finally, it is straighforward to see that

    distancefusion satisfies triangular inequality. tu

    Besides the above properties that must be satisfied by thecost functions in order to obtain a distance, others may beintroduced for specific purposes. Some will be discussed in

    Section 5.We now present an algorithm to compute the tree edit

    distance between two trees using the classical tree edit

    operations plus the two operations just introduced.

    4.2 Algorithm

    The method we introduce is a dynamic programming

    algorithm based on the one proposed by Zhang and Shasha.Their algorithm is divided in two parts: They first compute

    the edit distance between two trees (this part is denoted by

    TDist) and then the distance between two forests (this partis denoted by FDist). Fig. 8 illustrates in pictorial form the

    part TDist and Fig. 9 the FDist part of the computation.In order to take our two new operations into account, we

    need to compute a few more things in the TDist part.Indeed, we must add the possibility for each tree to have a

    node fusion (inversely, node split) between the root and one

    of its children, or to have an edge fusion (inversely edge

    split) between the root and one of its children. These

    additional operations are indicated in the right box of Fig. 8.We present now a formal description of the algorithm. Let

    T be an ordered rooted tree with jTj nodes. We denote by ti

    the ith node in a postfix order. For each node ti, li is the

    index of the leftmost child of the subtree rooted at ti. Let

    Ti . . .j denote the forest composed by the nodes ti . . . tj(T T0 . . . jTj. To simplify notation, from now on, when

    there is no ambiguity, i will refer to the node ti. In this case,

    distancei1 . . . i2; j1 . . .j2 will be equivalent to distanceTi1. . . i2; T

    0j1 . . .j2.The algorithm of Zhang and Sasha is fully described by

    the following recurrence formula:

    if i1 li2 andj1 lj2

    MI N

    distance i1 . . . i2 1 ; j1 . . .j2 costdeli2

    distance i1 . . . i2 ; j1 . . .j2 1 costinsj2

    distance i1 . . . i2 1 ; j1 . . .j2 1 costmatchi2; j2

    8>:

    1

    else

    MI N

    distance i1 . . . i2 1 ; j1 . . .j2

    costdeli2

    distance i1 . . . i2 ; j1 . . .j2 1

    costinsj2

    distance i1 . . . li2 1 ; j1 . . . lj2 1

    distance li2 . . . i2 ; lj2 . . .j2

    8>>>>>>>>>>>>>>>:

    2

    Part (1) of the formula corresponds to Fig. 8, while part (2)

    corresponds to Fig. 9. In practice, the algorithm stores in a

    matrix the score between each subtree ofT and T0. The space

    complexityis thereforeOjTj jT0j.Toreachthiscomplexity,

    the computation must be done in a certain order (see

    Section 4.3). The time complexity of the algorithm is

    OjTj minleafT; heightT

    jT0j minleafT0; heightT0;

    where leafT and heightT represent, respectively, the

    number of leaves and the height of a tree T.

    8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

    Fig. 8. Zhang and Sashas dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to

    take fusion into account.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    9/78

    The formula to compute the edit score allowing for both

    node and edge fusions follows.

    if i1 ! lik and j1 ! ljk0

    MI N

    distancefi1 . . . ik1g; ;; fj1 . . .jk0 g; path0 costdelikdistancefi1 . . . ikg; path; fj1 . . .jk01g; ; costinsjk0

    distancefi1 . . . ik1g; ;; fj1 . . .jk01g; ; costmatchik; jk0

    for each child ic of ik in fi1; . . . ; ikg; set il lic

    distancefi1 . . . ic1; ic1 . . . ikg; path:u; ic; fj1 . . .jk0 g;

    path0

    costnode f usionic; ikobs: :ik data are changed

    distancefil . . . ic1; ikg; path:e; ic; fj1 . . .jk0 g; path0

    costedge f usionic; ik distancefi1 . . . il1g;

    ;; ;; ;

    distancefic1 . . . ik 1; ;; ;; ;

    obs: : ik data are changed

    for each child jc0 of jk0 in fj1; . . . ; jk0 g; set jl0 ljc0

    distancefi1 . . . ikg; path; fj1 . . .jc01; jc01 . . .jk0 ;

    path0:u; jc0

    costnode splitjc0 ; jk0

    obs: : jk0 data are changed

    distancefi1 . . . ikg; path; fjl0 . . .jc0 ; jk0 ; path0:e; jc0

    costedge splitjc0 ; jk0

    distance;; ;; fj1 . . .jl0 1g; ;

    distance;; ;; jc01 . . .jk01; ;

    obs: : jk

    0 data are changed

    8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:

    3

    else set il lik and jl0 ljk0

    MI N

    distancefi1 . . . ik1g; ;; fj1 . . .jk0 g; path0 delik

    distancefi1 . . . ikg; path; fj1 . . .jk01g; ; insjk0

    distancefi1 . . . il1g; ;; fj1 . . .jl01g; ;

    distancefil . . . ikg; path; fjl0 . . .jk0 g; path0

    8>>>>>:

    4

    Given two nodes u and v such that v is a child of u,

    node fusionu; v is the fusion of node v with u, and

    edge fusionu; v is the edge fusion between the edges

    leading to, respectively, nodes u and v. The symmetric

    operations are denoted by, respectively, node splitu; v and

    edge splitu; v.The distance computation takes two new parameters

    path and path0. These are sets of pairs e or u;v which

    indicate, for node ik (respectively, jk), the series of fusions

    that were done. Thus, a pair e; v indicates that an edge

    fusion has been perfomed between ik and v, while for u; va node v has been merged with node ik.

    The notation path:e; v indicates that the operation e; v

    has been performed in relation to node ik and the

    information is thus concatenated to the set path of pairs

    currently linked with ik.

    4.3 Implementation and Complexity

    The previous section gave the recurrence formul for

    calculating the edit distance between two trees allowing for

    node and edge fusion and split. We now discuss the

    complexity of the algorithm. This requires paying attention

    to some high-level implementation details that, in the case

    of the tree edit distance problem, may have an important

    influence on the theoretical complexity of the algorithm.

    Such details were first observed by Zhang and Shasha. They

    concern the order in which to perform the operations

    indicated in (2) and (1) to obtain an algorithm that is time

    and space efficient.Let us consider the last line of (2). We may observe that

    the computation of the distance between two forests refersto the computation of the distance between two treesTli2 . . . i2 and T

    0lj2 . . .j2. We must therefore memor-ise the distance between any two subtrees of T and T0.Furthermore, we have to carry out the computation from

    the leaves to the root because when we compute thedistance between two subtrees U and U0, the distancebetween any subtrees of U and U0 must already have beenmeasured. This explains the space complexity which is inOjTj jT0j and corresponds to the size of the table used forstoring such distances in memory.

    If we look at (1) now, we see that it is not necessary tocalculate separately the distance between the subtreesrooted at i0 and j0 if i0 is on the path from li to i and j0

    is on the path from lj to j, for i and j nodes of,respectively, T and T0.

    We define a set LRT of the left roots of T as follows:

    LRT fkj1 k jTj and 69k0 > k such that lk0 lkg

    ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 9

    Fig. 9. Zhang and Sashas dynamic programming algorithm: the forest distance part.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    10/78

    The algorithm for computing the edit distance between t

    and T0 consists then in computing the distance between

    each subtree rooted at a node in LRT and each subtree

    rooted at a node in LRT0. Such subtrees are considered

    from the leaves to the root of T and T0, that is, in the order

    of their indexes.

    Zhang and Shasha proved that this algorithm has atime complexity in OjTj minleafT; heightT jT0j

    minleafT0; heightT0, leafT designating the num-

    ber of leaves of T and heightT its height. In the worst

    case (fan tree), the complexity is in OjTj2 jT0j2.Taking fusion and split operations into account does

    not change the above reasoning. However, we must now

    store in memory the distance between all subtrees

    Tli2 . . . i2 and T0lj2 . . .j2, and all the possible values

    of path and path0.We must therefore determine the number of values that

    path can take. This amounts to determine the total number

    of successive fusions that could be applied to a given node.

    We recall that path is a list of pairs e or u; v. Let path

    fe or u; v1; e or u; v2; . . . ; e or u; vg be the list for node i

    ofT. The first fusion can be performed only with a child v1of i. If d is the maximum degree of T, there are d possible

    choices for v1. The second fusion can be done with one of

    the children of i or with one of its grandchildren. Let v2 be

    the node chosen. There are d + d2 possible choices for v2.

    Following the same reasoning, there arePk

    k1 dk possible

    choices for the th node v to be fusioned with i.

    Furthermore, we must take into account the fact that a

    fusion can concern a node or an edge. The total number of

    values possible for the variable path is therefore:

    2 Ykk1

    Xjkj1

    dj 2lYkk1

    dk1 1d 1

    ;

    that is:

    2 1

    d 1

    Ykk1

    dk1 1 < 2l 1

    d 1

    ld

    122 :

    A node i may then be involved in O2dl possible

    successive (node/edge) fusions.As indicated, we must store in memory the distance

    between each subtree Tli2 . . . i2 and T0lj2 . . .j2 for all

    possible values of path and path

    0

    . The space complexity of

    our algorithm is thus in O2d 2d0 jTj jT0j, with dand d0 the maximum degrees of, respectively, T and T0.

    The computation of the time complexity of our algorithm

    is done in a similar way as for the algorithm of Zhang and

    Shasha. For each node of T and T0, one must compute the

    number of subtree distance computations the node will be

    involved in by considering all subtrees rooted in, respec-tively, a node of LRT and a node of LRT0. In our case,

    one must also take into account for each node the possibility

    of applying a fusion. This leads to a time complexity in

    O2d jTj minleafT; heightT 2d0 jT0j

    minleafT0; heightT0:

    This complexity suggests that the fusion operations may

    be used only for reasonable trees (typically, less than

    100 nodes) and small values of l (typically, less than 4). It is

    however important to observe that the overall number of

    fusions one may perform can be much greater than l

    without affecting the worst-case complexity of the algo-

    rithm. Indeed, any number of fusions can be made while

    still retaining the bound of

    O2dl jTj minleafT; heightT jT0j minleafT0;

    heightT0

    so long as one does not realize more than l consecutive

    fusions for each node.In general, also, most interesting tree representations of

    an RNA are of small enough size as will be shown next,

    together with some initial results obtained in practice.

    5 APPLICATION TO RNA SECONDARY STRUCTURESCOMPARISON

    The algorithm presented in the previous section has beencoded using C++. An online version is available at http://www-igm.univ-mlv.fr/~allali/migal/.

    We recall that RNAs are relatively small molecules with

    sizes limited to a few kilobases. For instance, the small

    ribosomal subunit of Sulfolobus acidocaldarius (D14876) is

    made up of 1,147 bases. Using the representation shown in

    Fig. 2b, the tree obtained contains 440 internal nodes and

    567 leaves, that is 1,007 nodes overall. Using the representa-

    tion in Fig. 2d, the tree is composed of 78 nodes. Finally, thetree obtained using the representation given in Fig. 2e

    contains only 48 nodes. We therefore see that even for large

    RNAs, any of the known abstract tree-representations (that

    is, representations which take the elements of the secondary

    structure of an RNA into account) that we can use leads to a

    tree of manageable size for our algorithm. In fact, for small

    values of l (2 or 3), the tree comparison takes reasonable

    time (a few minutes) and memory (less than 1Gb).

    As we already mentioned, a fusion (respctively, split) can

    be viewed as an alternative to a deletion (respectively,

    insertion) followed by a relabeling. Therefore, the cost

    function for a fusion must be chosen carefully.

    10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    11/78

    To simplify, we reason on the cost of a node fusion

    without considering the label of the edges leading to the

    nodes that are fusioned with a father. The formal definition

    of the cost functions takes the edges also into account.

    Let us assume that the cost function returns a realvalue between zero and one. If we want to compute thecost of a fusion between two nodes u and v, the aim is togive to such fusion a cost slightly greater than the cost ofdeleting v and relabeling u; that is, we wish to havecostnode f usionu; v mincostdelv t; 1. The parameter tis a tuning parameter for the fusion.

    Suppose that the new node w resulting from the fusion of

    u and v matches with another node z. The cost of this match

    is costmatchw; z. If we do not allow for node fusions, the

    algorithm will first match u with z, then will delete v. If we

    compare the two possibilities, on one hand we have a total

    cost of costnode f usionu; v costmatchw; z for the fusion,

    that is, costdelv t costmatchw; z, on the other hand, acost ofcostdelv costmatchu; z. Thus, t represents the gain

    that must be obtained by costmatchw; z with regard to

    costmatchu; z, that is, by a match without fusion. This is

    illustrated in Fig. 10.

    In this example,the cost associated with thepathon thetop

    is costmatch5; 9 costdel3. The pathat the bottomhasa cost

    of costnode f usion5; 3 costdel3 t for the node fusion to

    whichis added a relabeling cost ofcostmatch8; 9, leading toa

    total of costmatch8; 9 costdel3 t. A node fusion will

    therefore be chosen if costmatch8; 9 t > costmatch5; 9,

    therefore if the score of a match with fusion is better by at

    least t than a match without fusion.

    We applythe same reasoning to the cost of an edge fusion.

    The cost function for a node and an edge fusion between a

    node u and a node v, with eu denoting the edge leading to u

    and ev the edge leading to v is defined as follows:

    costnode f usionu; v costdelv costdelev t

    costedge f usionu; v costdelu costdeleu t

    X

    csibling ofv

    cost deleting subtree rooted at c:

    The tuning parameter t is thus an important parameter

    that allows us to control fusions. Always considering a cost

    function that produces real values between 0 and 1, if t is

    equal to 0:1, a fusion will be performed only if it improves

    the score by 0:1. In practice, we use values of t between 0

    and 0:2.For practical considerations, we also set a further

    condition on the cost and relabeling functions related to a

    node or edge resulting from a fusion which is as follows:

    costdela costdelb ! costdelc

    with c the label of the node/edge resulting from the fusion

    of the nodes/edges labeled a and b. Indeed, if this condition

    is not fulfilled, the algorithm may systematically fusion the

    nodes or edges to reduce the overall cost.An important consequence of the conditions seen above

    is that a node fusion cannot be followed by an edge fusion.

    Below, the node fusion followed by an edge fusion costs:

    costdelb costdelB t costdelAB costdela t:

    ThealternativeistodestroynodeB(togetherwith edgeb)andthen to operate an edge fusion, the whole costing: costdelb

    costdelB costdelA costdela t. The difference be-

    tween these two costs is t costdelAB costdelA,whichis

    always positive.

    This observation allows to significantly improve the

    performance in practice of the algorithm.We have applied the new algorithm on the two RNAs

    shown in Fig. 5 (these are eukaryotic nuclear P RNAs from

    Saccharomyces uvarum and Saccharomyces kluveri) and coded

    using the same type of representation as in Fig. 2d. We have

    limited the number of consecutive fusions to one (l 1).

    The computation of the edit distance between the two trees

    taking node and edge fusions into account besides dele-

    tions, insertions, and relabeling has required less than a

    second. The total cost allowing for fusions is 6:18 with t

    0:05 against 7:42 without fusions. As indicated in Fig. 11, the

    last two problems discussed in Section 3 disappear thanks

    to some edge fusions (represented by the boxes).An example of node fusions required when comparing

    two real RNAs is given in Fig. 12. The RNAs are coded

    using the same type of representation as in Fig. 2d. The

    figure shows part of the mapping obtained between the

    small subunits of two ribosomal RNAs retrieved from [8]

    (from Bacillaria paxillifer and Calicophoron calicophorum). The

    node fusion has been circled.

    ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 11

    Fig. 10. Illustration of the gain that must be obtained using a fusion

    instead of a deletion/relabeling.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    12/78

    6 MULTILEVEL RNA STRUCTURE COMPARISON:SKETCH OF THE MAIN IDEA

    We briefly discuss now an approach which addresses in

    part the scattering effect problem (see Section 2). This

    approach is being currently validated and will be more fully

    described in another paper. We therefore present here the

    main idea only.

    To start with, it is important to understand the nature of

    this scattering effect. Let us consider first a trivial case: the

    cost functions are unitary (insertion, deletion, and relabeling

    each cost 1) and we compute the edit distance between two

    trees composed of a single node each. The obtained mapping

    will associate the single node in the first tree with the singleone in the second tree, independently from the labels of the

    nodes. This example can be extended to the comparison of

    two trees whose node labels are all different. In this case, the

    obtained mapping corresponds to the maximum home-

    omorphic subtree common to both trees.If the two RNA secondary structures compared using a

    tree representation which models both the base pairs and

    the nonpaired bases are globally similar but present some

    local dissimilarity, then an edit operation will almost

    always associate the nodes of the locally divergent regions

    that are located at the same positions relatively to the globalcommon structure. This is a normal, expected behavior in

    the context of an edition. However, it seems clear also when

    we look at Fig. 4 that the bases of a terminal loop should not

    be mapped to those of a multiple loop.To reduce this problem, one possible solution consists of

    adding to the nodes corresponding to a base an information

    concerning the element of secondary structure to which the

    base belongs. The cost functions are then adapted to take

    this type of information into account. This solution,

    although producing interesting results, is not entirely

    satisfying. Indeed, the algorithm will tend to systematically

    put into correspondence nodes (and, thus, bases) belonging

    to structural elements of the same type, which is also not

    necessarily a good choice as these elements may not be

    related in the overall structure. It seems therefore preferable

    to have a structural approach first, mapping initially the

    elements of secondary structure to each other and taking

    care of the nucleotides in a second step only.The approach we have elaborated may be briefly

    described as follows: Given two RNA secondary structures,

    the first step consists in coding the RNAs by trees of type c

    in Fig. 2 (nodes represent bulges or multiple, internal or

    12 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

    Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.

    Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    13/78

    terminal loops while edges code for helices). We thencompute the edit distance between these two trees using the

    two novel fusion operations described in this paper. This

    also produces a mapping between the two trees. Each node

    and edge of the trees, that is, each element of secondary

    structure, is then colored according to this mapping. Two

    elements are thus of a same color if they have been mapped

    in the first step. We now have at our disposal an

    information concerning the structural similarity of the two

    RNAs. We can then code the RNAs using a tree of type b.

    To these trees, we add to each node the colour of the

    structural element to which it belongs. We need now only to

    restrict the match operation to nodes of the same color. Two

    nodes can therefore match only if they belong to secondary

    elements that have been identified in the first step as being

    similar.To illustrate the use of this algorithm, we have applied it

    to the two RNAs of Fig. 4. Fig. 13 presents the trees of type(Fig. 2c) coding for these structures, and the mappingproduced by the computation of the edit distance withfusion. In particular, the noncolored fine dashed nodes andedges correspond, respectively, to deleted nodes/edges.One can see that in the left RNA, the two hairpin loopsinvolved in the scattering effect problem in Fig. 4 (indicatedby the arrows) have been destroyed and will not be mappedto one another anymore when the edit operations are

    applied to the trees of the type in Fig. 2b.This approach allows to obtain interesting results.

    Furthermore, it considerably reduces the complexity of

    the algorithm for comparing two RNA structures coded

    with trees of the type in Fig. 2b. However, it is important to

    observe that the scattering effect problem is not specific of

    the tree representations of the type in Fig. 2b. Indeed, the

    same problem may be observed, to a lesser degree, with

    trees of the type in Fig. 2c. This is the reason why we

    generalize the process by adopting a modelling of RNA

    secondary structures at different levels of abstraction. This

    model, and the accompanying algorithm for comparing

    RNA structures, is in progress.

    7 FURTHER WORK AND CONCLUSIONWe have proposed an algorithm that addresses two main

    limitations of the classical tree edit operations for compar-

    ing RNA secondary structures. Its complexity is high in

    theory if many fusions are applied in succession to any

    given (the same) node, but the total number of fusions that

    may be performed is not limited. In practice, the algorithm

    is fast enough for most situations one can meet in practice.To provide a more complete solution to the problem of

    the scattering effect, we also proposed a new multilevel

    approach for comparing two RNA secondary structures

    whose main idea was sketched in this paper. Further details

    and evaluation of such novel comparison scheme will be thesubject of another paper.

    REFERENCES[1] D. Bouthinon and H. Soldano, A New Method to Predict the

    Consensus Secondary Structure of a Set of Unaligned RNASequences, Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.

    [2] J.W. Brown, The Ribonuclease P Database, Nucleic AcidsResearch, vol. 24, no. 1, p. 314, 1999.

    [3] N. el Mabrouk and F. Lisacek, and Very Fast Identification ofRNA Motifs in Genomic DNA. Application to tRNA Search in theYeast Genome, J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.

    [4] I. Hofacker, The Vienna RNA Secondary Structure Server, 2003.[5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.

    Tacker, and P. Schuster, Fast Folding and Comparison of RNA

    Secondary Structures, Monatshefte fur Chemie, vol. 125, pp. 167-188, 1994.[6] M. Hochsmann, T. Toller, R. Giegerich, and S. Kurtz, Local

    Similarity in RNA Secondary Structures, Proc. IEEE Computer Soc.Conf. Bioinformatics, p. 159, 2003.

    [7] M. Hochsmann, B. Voss, and R. Giegerich, Pure Multiple RNASecondary Structure Alignments: A Progressive Profile Ap-proach, IEEE/ACM Trans. Computational Biology and Bioinfor-matics, vol. 1, no. 1, pp. 53-62, 2004.

    [8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, TheEuropean Database on Small Subunit Ribosomal RNA, Nucleic

    Acids Research, vol. 30, no. 1, pp. 183-185, 2002.[9] T. Jiang, L. Wang, and K. Zhang, Alignment of TreesAn

    Alternative to Tree Edit, Proc. Fifth Ann. Symp. CombinatorialPattern Matching, pp. 75-86, 1994.

    [10] F. Lisacek, Y. Diaz, and F. Michel, Automatic Identification ofGroup I Intron Cores in Genomic DNA Sequences, J. Molecular

    Biology, vol. 235, no. 4, pp. 1206-1217, 1994.

    ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 13

    Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting

    from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for

    hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    14/78

    [11] B. Shapiro, An Algorithm for Multiple RNA Secondary Struc-tures, Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387-393, 1988.

    [12] B.A. Shapiro and K. Zhang, Comparing Multiple RNA SecondaryStructures Using Tree Comparisons, Computer Applications in theBiosciences, vol. 6, no. 4, pp. 309-318, 1990.

    [13] K.-C. Tai, The Tree-to-Tree Correction Problem, J. ACM, vol. 26,no. 3, pp. 422-433, 1979.

    [14] K. Zhang and D. Shasha, Simple Fast Algorithms for the Editing

    Distance between Trees and Related Problems, SIAM J. Comput-ing, vol. 18, no. 6, pp. 1245-1262, 1989.[15] M. Zuker, Mfold Web Server for Nucleic Acid Folding and

    Hybridization Prediction, Nucleic Acids Research, vol. 31, no. 13,pp. 3406-3415, 2003.

    Julien Allali studied at the University of Marnela Vallee (France), where he received the MScdegree in computer science and computationalgenomics. In 2001, he began his PhD incomputational genomics at the Gaspard MongeInstitute of the University of Marne la Vallee. Histhesis focused on the study of RNA secondarystructures and, in particular, their comparisonusing a tree distance. In 2004, he received the

    PhD degree.

    Marie-France Sagot received the BSc degree in computer science fromthe University of Sao Paulo, Brazil, in 1991, the PhD degree intheoretical computer science and applications from the University ofMarne-la-Vallee, France, in 1996, and the Habilitation from the sameuniversity in 2000. From 1997 to 2001, she worked as a researchassociate at the Pasteur Institute in Paris, France. In 2001, she movedto Lyon, France, as a research associate at the INRIA, the FrenchNational Institute for Research in Computer Science and Control. Since2003, she has been the Director of Research at the INRIA. Her researchinterests are in computational biology, algorithmics, and combinatorics.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    15/78

    Topological Rearrangements and Local SearchMethod for Tandem Duplication Trees

    Denis Bertrand and Olivier Gascuel

    AbstractThe problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch

    [4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing

    numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication

    trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,

    TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree

    Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these

    restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is

    applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all

    existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to

    tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any

    other program.

    Index TermsTandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc

    finger genes.

    1 INTRODUCTION

    REPEATED sequences constitute an important fraction ofmost genomes, from the well-studied Escherichia colibacterial genome [1] to the Human genome [2]. For

    example, it is estimated that more than 50 percent of the

    Human genome consists of repeated sequences [2], [3].

    There exist three major types of repeated sequences:

    transposon-derived repeats, micro or minisatellites, and

    large duplicated sequences, the last often containing one orseveral RNA or protein-coding genes. Micro or minisatel-

    lites arise through a mechanism called slipped-strand

    mispairing, and are always arranged in tandem: copies of

    a same basic unit are linearly ordered on the chromosome.

    Large duplicated sequences are also often found in tandem

    and, when this is the case, unequal recombination is widely

    assumed to be responsible for their formation.

    Both the linear order among tandemly repeated se-

    quences, and the knowledge of the biological mechanisms

    responsible for their generation, suggest a simple model of

    evolution by duplication. This model, first described by

    Fitch in 1977 [4], introduces tandem duplication trees as

    phylogenies constrained by the unequal recombination

    mechanism. Although being a completely different biologi-

    cal mechanism, slipped-strand mispairing leads to the same

    duplication model [5]. A formal recursive definition of this

    model is provided in Section 2, but its main features can be

    grasped from the examples of Fig. 1. Fig. 1a shows the

    duplication history of the 13 Antennapedia-class homeobox

    genes from the cognate group [6]. In this history, the

    ancestral locus has undergone a series of simple duplica-

    tion events where one of the genes has been duplicated into

    two adjacent copies. Starting from the unique ancestral

    gene, this series of events has produced the extant locus

    containing the 13 linearly ordered contemporary genes. It is

    easily seen [7] that trees only containing simple duplication

    events are equivalent to binary search trees with labeled

    leaves. They differ from standard phylogenies in that node

    children have left/right orientation. Fig. 1b shows another

    example corresponding to the nine variable genes of the

    human T cell receptor Gamma (TRGV) locus [8]. In this

    history, the most recent event involves a double duplica-

    tion where two adjacent genes have been simultaneously

    duplicated to produce four adjacent copies. Duplication

    trees containing multiple duplication events differ from

    binary search trees, but are less general than phylogenies.

    The model proposed by Fitch [4] covers both simple and

    multiple duplication trees.

    Fitchs paper [4] received relatively little attention at the

    time of its publication probably due to the lack of available

    sequence data. Rediscovered by Benson and Dong [9],

    Tang et al. [10], and Elemento et al. [8], tandemly repeated

    sequences and their suggested duplication model have

    recently received much interest, providing several new

    computational biology problems and challenges [11], [12].

    The main challenge consists of creating algorithms

    incorporating the model constraints to reconstruct the

    IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 15

    . The authors are with Projet Me thodes et Algorithmes pour la Bioinforma-tique, LIRMM (UMR 5506, CNRSUniv. Montpellier 2), 161 rue Ada,34392 Montpellier Cedex 5France. E-mail: [email protected].

    Manuscript received 11 Oct. 2004; revised 17 Dec. 2004; accepted 20 Dec.2004; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:

    [email protected], and reference IEEECS Log Number TCBBSI-0170-1004.1545-5963/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    16/78

    duplication history of tandemly repeated sequences.

    Indeed, accurate reconstruction of duplication histories

    will be useful to elucidate various aspects of genome

    evolution. They will provide new insights into the

    mechanisms and determinants of gene and protein domain

    duplication, often recognized as major generators of

    novelty [13]. Several important gene families, such as

    immunity-related genes, are arranged in tandem; better

    understanding their evolution should provide new insights

    into their duplication dynamics and clues about their

    functional specialization. Studying the evolution of micro

    and minisatellites could resolve unanswered biologicalquestions regarding human migrations or the evolution of

    bacterial diseases [14].

    Given a set of aligned and ordered sequences (DNA or

    proteins), the aim is to find the duplication tree that best

    explains these sequences, according to usual criteria in

    phylogenetics, e.g., parsimony or minimum evolution. Few

    studies have focused on the computational hardness of this

    problem, and all of these studies only deal with the

    restricted version where simultaneous duplication of multi-

    ple adjacent segments is not allowed. In this context, Jaitly

    et al. [15] shows that finding the optimal single copy

    duplication tree with parsimony is NP-Hard and that this

    problem has a PTAS (Polynomial Time Approximation

    Scheme). Another closely related PTAS is given by Tang

    et al. [10] for the same problem. On the other hand,

    Elemento et al. [7] describes a polynomial distance-based

    algorithm that reconstructs optimal single copy tandem

    duplication trees with minimum evolution.

    However, it is commonly believed, as in phylogeny, that

    most (especially multiple) duplication tree inference pro-

    blems are NP-Hard. This explains the development of

    heuristic approaches. Benson and Dong [9] provides various

    parsimony-basedheuristic reconstruction algorithms to infer

    duplication trees, especially from minisatellites. Elemento

    et al. [8] present an enumerative algorithm that computes the

    most parsimonious duplication tree; this algorithm (by its

    exhaustive approach) is limited to datasets of less than 15

    repeats. Several distance-based methods have also been

    described.The WINDOW method [10] usesan agglomeration

    scheme similar to UPGMA [16] and NJ [17], but the cost

    function used to judge potential duplication is based on the

    assumptionthat thesequencesfollow a molecular clockmode

    of evolution. The DTSCORE method [18] uses the same

    schemebut corrects this limitation using a score criterion[19],

    like ADDTREE [20]. DTSCORE can be used with sequences

    that do not follow the molecular clock, which is, for example,

    essential when dealing with gene families containing

    pseudogenes that evolve much faster than functional genes.

    Finally, GREEDY SEARCH [21] corresponds to a different

    approach divided into two steps: First, a phylogeny is

    computed with a classical reconstruction method (NJ), then,

    with nearest neighbor interchange (NNI) rearrangements, a

    duplication tree close to this phylogeny is computed. This

    approach is noteworthy since it implements topological

    rearrangements which are highly useful in phylogenetics

    [22], but it works blindly and does not ensure that goodduplication trees will be found (cf. Section 5.2).

    Topological rearrangements have an essential function in

    phylogenetic inference, where they are used to improve an

    initial phylogeny by subtree movement or exchange.

    Rearrangements are very useful for all common criteria

    (parsimony, distance, maximum likelihood) and are inte-

    grated into all classical programs like PAUP* [23] or

    PHYLIP [24]. Furthermore, they are used to define various

    distances between phylogenies and are the foundation of

    much mathematical work [25]. Unfortunately, they cannot

    be directly used here, as shown by a simple example given

    16 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

    Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].

    (b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In

    both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    17/78

    later. Indeed, when applied to a duplication tree, they do

    not guarantee that another valid duplication tree will be

    produced.

    In this paper, we describe a set of topological rearrange-

    ments to stay inside the duplication tree space and explore

    the whole space from any of its elements. We then show the

    advantages of this approach for duplication tree inference

    from sequences. In Section 2, we describe the duplication

    model introduced by [4], [8], [10], as well as an algorithm to

    recognize duplication trees in linear time. Thanks to this

    algorithm, we restrict the neighborhoods defined by

    classical phylogeny rearrangements, namely, nearest neigh-

    bor interchange (NNI) and subtree pruning and regrafting

    (SPR), to valid duplication trees. We demonstrate (Section 3)

    that for NNI moves this restricted neighborhood does not

    allow the exploration of the whole duplication tree space.

    On the other hand, we demonstrate that the restricted

    neighborhood of SPR rearrangement allows the whole

    space to be explored. In this way, we define a local search

    method, applied here to parsimony and minimum evolu-

    tion (Section 4). We compare this method to other existing

    approaches using simulated and real data sets (Section 5).

    We conclude by discussing the positive results obtained by

    our method, and indicate directions for further research

    (Section 6).

    2 MODEL

    2.1 Duplication History and Duplication Tree

    The tandem duplication model used in this article was first

    introduced by Fitch [4] then studied independently by [8],

    [10]. It is based on unequal recombination which is assumed

    to be the sole evolution mechanism (except point mutations)

    acting on sequences. Although it is a completely different

    biological mechanism, slipped-strand mispairing leads to

    the same duplication model [5], [9].

    Let O 1; 2; . . . ; n be the ordered set of sequences

    representing the extant locus. Initially containing a single

    copy, the locus grew through a series of consecutive

    duplications. As shown in Fig. 2a, a duplication history

    may contain simple duplication events. When the dupli-

    cated fragment contains two, three, or k repeats, we say that

    it involves a multiple duplication event. Under this

    duplication model, a duplication history is a rooted tree

    with n labeled and ordered leaves, in which internal nodes

    of degree 3 correspond to duplication events. In a real

    duplication history (Fig. 2a), the time intervals between

    consecutive duplications are completely known, and the

    internal nodes are ordered from top to bottom according to

    the moment they occurred in the course of evolution. Anyordered segment set of the same height then represents an

    ancestral state of the locus. We call such a set a floor, and

    we say that two nodes i; j are adjacent (i 0 j) if there is a

    floor where i and j are consecutive and i is on the left of j.

    However, in the absence of a molecular clock mode of

    evolution (a typical problem), it is impossible to recover the

    order between the duplication events of two different

    lineages from the sequences. In this case, we are only able to

    infer a duplication tree (DT) (Fig. 2b) or a rooted

    duplication tree (RDT) (Fig. 2c).

    A duplication tree is an unrooted phylogeny with

    ordered leaves, whose topology is compatible with at least

    one duplication history. Also, internal nodes of duplication

    trees are partitioned into events (or blocks following

    [10]), each containing one or more (ordered) nodes. We

    distinguish simple duplication events that contain a

    unique internal node (e.g., b and f in Fig. 2c) and multiple

    duplication events which group a series of adjacent and

    simultaneous duplications (e.g., c, d, and e in Fig. 2c). Let

    E si; si1; . . . ; sk denote an event containing internal

    nodes si; si1; . . . ; sk in left to right order. We say that two

    consecutive nodes of the same event are adjacent (sj 0 sj1)

    just like in histories, as any event belongs to a floor in all of

    BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17

    Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the

    possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position 1 on (b).

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    18/78

    the histories that are compatible with the DT being

    considered. The same notation will also be used for leaves

    to express the segment order in the extant locus. When the

    tree is rooted, every internal node sj is unambiguously

    associated to one parent and two child nodes; moreover,

    one child ofsj is left and the other one is right, which is

    denoted as lj and rj, respectively. In this case, for anyduplication history that is compatible with this tree, child

    nodes of an event, si; si1; . . . ; sk are organized as follows:

    li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk:

    In [8] , [26], [27], i t w as shown t hat r ooting a

    duplication tree is different than rooting a phylogeny:

    the root of a duplication tree necessarily lies on the tree

    path between the most distant repeats on the locus, i.e., 1

    and n; moreover, the root is always located above all

    multiple duplications, e.g., Fig. 1b shows that there are

    only three valid root positions, the root cannot be a direct

    ancestor of 12.

    2.2 Recursive Definition of Rooted and UnrootedDuplication Trees

    A duplication tree is compatible with at least one duplica-

    tion history. This suggests a recursive definition, which

    progressively reconstructs a possible history, given a

    phylogeny T and a leaf ordering O. We define a cherry

    l;s;r as a pair of leaves (l and r) separated by a single

    node s in T, and we call CT the set of cherries of T. This

    recursive definition reverses evolution: It searches for a

    visible duplication event, agglomerates this event, and

    checks whether the reduced tree is a duplication tree. Incase of rooted trees, we have:

    T ; O defines a duplication tree with root if and only if:1. T ; O only contains , or

    2. there is in CT a series of cherriesli; si; ri; li1; si1; ri1; . . . ; lk; sk; rk

    with k ! i and

    li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk in O, suchthat T; O defines a duplication tree with root ,where T is obtained from T by removing

    li; li1; . . . ; lk; ri; ri1; . . . ; rk, and O is obtained by

    replacing li; li1; . . . ; lk; ri; ri1; . . . ; rk by

    si; si1

    ;. . .

    ; sk in O.

    The definition for unrooted trees is quite similar:

    T ; O defines an unrooted duplication tree if and only if:1. T ; O contains 1 segment, or2. same as for rooted trees with T; O now defining an

    unrooted duplication tree.

    Those definitions provide a recursive algorithm, RADT

    (Recognition Algorithm for Duplication Trees), to check

    whether any given phylogeny with ordered leaves is a

    duplication tree. In case of success, this algorithm can also

    be used to reconstruct duplication events: At each step, the

    series of internal nodes above denoted as si; si1; . . . ; sk is

    a duplication event. When the tree is rooted, lj is the left

    child of sj and rj its right child, for every j; i j k. This

    algorithm can be implemented in On [26] where n is the

    number of leaves. Another linear algorithm is proposed by

    Zhang et al. [21] using a top down approach instead of a

    bottom-up one, but applies only to rooted duplication trees.

    3 TOPOLOGICAL REARRANGEMENTS FORDUPLICATION TREES

    This section shows how to explore the DT space using SPR

    rearrangements. First, we describe some NNI, SPR, and

    TBR rearrangement properties with standard phylogenies.

    But, these rearrangements cannot be directly used to

    explore the DT space. Indeed, when applied to a duplica-

    tion tree, they do not guarantee that another valid

    duplication tree will be produced. So, we have decided to

    restrict the neighborhood defined by those rearrangements

    to duplication trees. If we only used NNI rearrangements,

    the neighborhood would be too restricted (as shown by a

    simple example) and would not allow the whole DT space

    to be explored. On the other hand, we can distinguish two

    types of SPR rearrangements which, when applied to a

    rooted duplication tree guarantee that another valid

    duplication tree will be produced. Thanks to these specific

    rearrangements, we demonstrate that restricting the neigh-

    borhood of SPR rearrangements allows the whole space of

    duplication trees to be explored.

    18 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

    Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose RT is a rooted version; T is obtained by

    applying NNI(5,4) around the bold edge; none of the possible root positions of T (a, b, c, and d) leads to a valid RDT, cf. tree (b) which

    corresponds to root b in T.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    19/78

    3.1 Topological Rearrangements for Phylogeny

    There are many ways of carrying out topological rearrange-

    ments on phylogeny [22]. We only describe NNI (Nearest

    Neighbor Interchange), SPR (Subtree Pruning Regrafting),

    and TBR (Tree Bisection and Reconnection) rearrangements.

    The NNI move is a simple rearrangement which

    exchanges two subtrees adjacent to the same internal edge(Figs. 3 and 4). There are two possible NNIs for each

    internal edge, so 2n 3 neighboring trees for one tree

    with n leaves. This rearrangement allows the whole space of

    phylogeny to be explored; i.e., there is a succession of NNI

    moves making it possible to transform any phylogeny P1into any phylogeny P2 [28].

    The SPR move consists of pruning a subtree and

    regrafting it, by its root, to an edge of the resulting tree

    (Figs. 6 and 7). We note that the neighborhood of a tree

    defined by the NNI rearrangements is included in the

    neighborhood defined by SPRs. The latter rearrangement

    defines a neighborhood of size 2n 32n 7 [25].Finally, TBR generalizes SPR by allowing the pruned

    subtree to be reconnected by any of its edges to the resulting

    tree. These three rearrangements (NNI, SPR, and TBR) are

    reversible, that is, if T is obtained from T by a particular

    rearrangement, then T can be obtained from T using the

    same type of rearrangement.

    3.2 NNI Rearrangements Do Not Stay in DT Space

    The classical phylogenetic rearrangements (NNI, SPR,

    TBR,...) do not always stay in DT space. So, if we apply

    an NNI to a DT (e.g., Fig. 3), the resulting tree is not always

    a valid DT. This property is also true for SPR and TBRrearrangements since NNI rearrangements are included in

    these two rearrangement classes.

    3.3 Restricted NNI Does Not Allow the Whole DTSpace to Be Explored

    To restrict the neighborhood defined by NNI rearrange-

    ments to duplication trees, each element of the neighbor-

    hood is filtered thanks to the recognition algorithm (RADT).

    But, this restricted neighborhood does not allow the whole

    DT space to be explored. Fig. 4 gives an example of a

    duplication tree, T, the neighborhood of which does not

    contain any DT. So, its restricted neighborhood is empty,

    and there is no succession of restricted NNIs allowing T to

    be transformed into any other DT.

    3.4 Restricted SPR Allows the Whole DT Space toBe Explored

    As before, we restrict (using RADT) the neighborhood

    defined by SPR rearrangements to duplication trees. We

    name restricted SPR, SPR moves that, starting from a

    duplication tree, lead to another duplication tree.

    Main Theorem. Let T1 and T2 be any given duplication trees; T1

    can be transformed into T2 via a succession of restricted SPRs.

    Proof. To demonstrate the Main Theorem, we define two

    types of special SPR that ensure staying within the space

    of rooted duplication trees (RDT). Given these two types

    of SPRs, we demonstrate that it is possible to transform

    any rooted duplication tree into a caterpillar, i.e., a

    rooted tree in which all internal nodes belong to the tree

    path between the leaf1 and the tree root (cf. Fig. 5).

    This result demonstrates the theorem. Indeed, let T1and T2 be two RDTs. We can transform T1 and T2 into a

    caterpillar by a succession of restricted SPRs. So, it is

    possible to transform T1 into T2 by a succession of

    restricted SPRs, with (possibly) a caterpillar as inter-

    mediate tree. This property holds since the reciprocal

    movement of an SPR is an SPR. As the two SPR types

    proposed ensure that we stay within the RDTs space, we

    have the desired result for rooted duplication trees. And,

    this result extends to unrooted duplications trees since

    two DTs can be arbitrarily rooted, transformed from one

    to the other using restricted SPRs, then unrooted. tu

    The first special SPR allows multiple duplication

    events to be destroyed. Let E si; si1; . . . ; sk be a

    duplication event, ri and lk respectively right child of si

    BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 19

    Fig. 5. A six-leaf caterpillar.

    Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose RT is a rooted version; T is obtained by

    exchanging subtrees 1 and (2 5); none of the possible root positions of T (a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds

    to root b in T; and the same holds for every neighbor of T being obtained by NNI.

  • 7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)

    20/78

    and left child of sk, and let pi be the father of si. The

    DELETE rearrangement consists of pruning the subtree of

    root ri and grafting this subtree on the edge sk; lk, while

    li is renamed si and the edge li; si is deleted. Fig. 6

    demonstrates this rearrangement.

    Lemma 1. DELETE preserves the RDT property.

    Proof. Let T be the initial tree (Fig. 6a), E si; si1; . . . ; sk

    be an event of T, and T be the tree obtained from T by

    applying DELETE to E (Fig. 6b). Children of any node sj(i j k) are denoted lj and rj.

    By definition, for any duplication history compatible

    with T we have

    li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk:

    Thus, there is a way to partially agglomerate T (using an

    RADT-like procedure) such that these nodes becomes

    leaves. The same agglomeration can be applied to T as

    only ancestors of the ljs and rjs are affected by DELETE.

    Now, 1) agglomerate the event E of T, and 2) reduce T

    by agglomerating the cherry lk; ri and then agglomer-

    ating the