A new combinatorial approach to sequence comparson Sabrina Mantaci University of Palermo Joint work...

A new combinatorial approach to sequence comparson

Sabrina Mantaci University of Palermo

Joint work with Antonio Restivo, Giovanna Rosone and Marinella Sciortino

Comparing sequences• Most of the traditional methods for comparing

sequences are based on sequence alignment.• Sequence alignment just consider local mutation,

but not mutations that involve longer genomic sequences, like, for instance, segment rearrangements.

• New aligment-free methods, based on information theory and data compression, have recently been introduced, in order to better capture common features of genomic sequences, that reflect in common evolutionary and functional mechanisms.

In this talk:

• Recall some of the existing alignment-free comparison methods

• Introduce a new alignment free method based on the Burrows-Wheeler transformation.

First approach:k-tuple Count Euclidean Distance

Count the number of occurrences of the lexicographically sorted list of the length-k factors of u and v, dispose them in twovectors and , and compute the sum of the absolutedifferences

k=3

(

k|A|

i

vk

ukk inin(u,v)d

1

][][

ukn v

kn

u=abababaabbabbabbbaaabbabbabaabv=aababababbabbbabaaaababbbbbabb

vn3

un3

( 2,

1, 3,

2,

4

5,

5,

4,

3,

1,

6,

7,

5,

3,

1)

4)

1232121111),(3 vud

Second approach: Distances based on compression

This method is based on the intuition that the more similar two sequences are, the more effective their joint compression than their independent compression.Given a compressor C, the normalized compression distance is defined as

)}(),(max{)}(),(min{)}(),(min{),(

vCuCvCuCvuCuvCvuNCD

)()()()()()(

)()()()(

yzCxzCzCxyCyxCxyCxCxyCxCxxC

C is a normal compressor

Theorem: If C is normal then NCD is a metric

Third approach: block-edit distanceCharacters edit: character insertions, deletions, substitutionsBlock edit: block copying, deletion, relocation

Block-edit distance DB(x y) is the minimum number of character and block edit operations needed to transform x to y.

Theorem: The Block-edit distance problem is NP-complete

a b a c b b a c cx = a b a c b b c

y = a b a c b a c c

Distance based on Ziv-Merhav parsing

Recall the LZ77-parsing of a word x

x = a b a a b a b a a b a a b a b a a b a b c(x)=6

The ZM-parsing (Ziv-Merhav parsing) of y with respect to x is ZM(y|x)=(z1z2… zmzm+1… zn) where y= z1… zn and each zi is a factor of x. The value n is the complexity of y with respect to x, denoted by c(y|x).

x = b b a b b a b a b a by = a b b b a b a b b b a

c(y|x)=3

The Ziv-Merhav distance from a word x to a word y isDZM(x y) = c(y|x)-1

DZM(x y)=2

Our distance based on a parsing of the extended bwt

We introduce a notion of Burrows Wheeler transform extended to a multiset S of words

We define a distance of two words u and v based on a suitable parsing of the extended bwt of the pair (u,v).

This distance takes into account the common factors of u and v without any restriction on the size of such factors.

The Burrows-Wheeler Transform• The Burrows-Wheeler Transform (denoted by bwt) is a

wellknown transformation introduced in

[M. Burrows and D. Wheeler, A block sorting data compression algorithm, Technical report, DIGITAL System Research Center, 1994]

• bwt is a reversible transformation that produces a permutation bwt(w) of an input sequence w, defined over an ordered alphabet A, such that the transformed sequence is “easier” to compress than the original one.

• The bwt represents for instance the heart of the BZIP2 algorithm that has become the standard for lossless data compression.

The BWT and Data Compression

• P. Fenwick, "The Burrows-Wheeler Transform for Block Sorting Text Compression - Principles and Improvements", The Computer Journal, Vol. 39(9), pp 731-740, 1996. • G. Manzini, “An Analysis of the Burrows-Wheeler Transform” Journal of the ACM. Vol. 48, n. 3, (2001), 407-430. • P. Ferragina, R. Giancarlo, G. Manzini, M. Sciortino, “Boosting Textual Compression in Optimal Linear Time”, Journal of the ACM. Vol. 52 (2005), 688-713.• P. Ferragina, G. Manzini, “Indexing Compressed Text”, Journal of the ACM. Vol. 52 (2005), 552-581.

How does BWT work?BWT takes as input a text w of length n and produces:

A permutation of the characters in the text bwt(w) An index I

Example : w=“banana”, n=6

Mc

a b a n a na n a b a na n a n a bb a n a n an a b a n an a n a b a

bwt(w)= L = “nnbaaa”;

F L

I

I = 4

PROPERTIES• For each symbol a, the i-th

occurrence of a in F corresponds to the i-th occurrence of a in L;

• For each j = 1…n, jI, the symbol L[j] is followed in w by the symbol F[j];

• The first character of w is F[I].

F is deduced by lexicographically sorting the characters of L;

Construct the permutation such that for all j=1…n, if F[j] is the k-th occurrence of in F, then L[[j]] is the k-th occurrence of in L.

w[1] = F[I]=F[4]=b w[2] = F[]=F[3]=a w[3] = F[]=F[3]]=F[6]=n w[4] = F[]=F[2]]=F[2]=a w[5] = F[]=F[5]]=F[5]=n w[6] = F[]=F[3]]=F[1]=a

526341213654654321

bwt(w)= L = “nnbaaa”; I = 4

The inverse transform

Riga F L1 a n2 a n3 a b4 b a5 n a6 n a

The BWT and Combinatorics on Words

• Relationship with Sturmian words.[S. Mantaci, A. Restivo, M. Sciortino, “Burrows-Wheeler Transform and Sturmian Words”, Information Processing Letters, Vol. 86(5), 241-246 (2003).]• Relationship with combinatorics on permutations.

[M. Crochemore, J. Désarménien, D. Perrin, “A note on the Burrows-Wheeler Transformation”, Theoretical Computer Science, Vol. 332(1-3), 567-572 (2005).][M. Gessel, C. Reutenauer, “Counting permutations with given cycle structure and descent set”, J. Comb. Theory Ser. A, 64(2), 189-215, 1993]

A new order relation between words

For every word v, there exists a unique primitive word w and an integer k such that v=wk. By notation, w=root(v) and k=exp(v)

where u<lexvdenotes the usual lexicographic order betweeninfinite words.

u≤v exp(u)<exp(v) if root(u)=root(v)u<lexvotherwise

Proposition (Fine and Wilf): Given two primitive words u and v,

u≤v prefk(u)<lexprefk(v)where k=|u|+|v|-gcd(|u|,|v|).

The bound is tight: indeed abaababa≤abaab because

abaababaabab… abaababaabaa…

differ for the character in position 12=8+5-1.

Notice that the < order is different from the lexicographic one. For instance ab<lexaba, but aba≤ab , since abaabaaba… <lexababab…

The Extended Transform ebwtINPUT: S={abac, cbab, bca, cba}.

a b a c a b …a b c a b c …a b c b a b …a c a b a c …a c b a c b …b a b c b a …b a c a b a …b a c b a c …b c a b c a …b c b a b c …c a b a c a …c a b c a b …c b a b c b …c b a c b a …

1 a b a c 2 a b c 3 a b c b4 a c a b 5 a c b6 b a b c 7 b a c a8 b a c9 b c a10 b c b a11 c a b a12 c a b13 c b a b14 c b a

OUTPUT: (ccbbbcacaaabba, {1,9,13,14})

• Sort all the conjugates of the words in S by the order relation;• Consider the sequence of the sorted words and take the word S’ obtained by concatenating the last letter of each word;• Take the set I containing the positions of the words corresponding to the ones in S; • The output of the ebwt transformation is the pair (S’,I).

1 a b a c 2 a b c 3 a b c b4 a c a b 5 a c b6 b a b c 7 b a c a8 b a c9 b c a10 b c b a11 c a b a12 c a b13 c b a b14 c b a

Properties of the ebwt transformINPUT: S={abac, cbab, bca, cba}.

• In any row iI, the first symbol follows the last one, in one of the words in S. • For each character, the i-th occurrence in the first column corresponds to the i-th occurrence in the red column.• The first character of the word uj is the character F[Ij]

Inversion of the extended transform

1 a c

2 a c

3 a b

4 a b

5 a b

6 b c

7 b a

8 b c

9 b a

10 b a

11 c a

12 c b

13 c b

14 c a

F= a a a a a b b b b b c c c c

1 2 3 4 5 6 7 8 9 10 11 12 13 14

7 9 10 11 14 3 4 5 12 13 1 2 6 8

L= c c b b b c a c a a a b b a

=

S={abac,

= (1 7 4 11) a b a c

(2 9 12) a b c

(3 10 13 6) a b c b

(5 14 8) a c b

ebwt(S)=L=ccbbbcacaaabba I={1, 9, 13, 14}

F L

bca,cbab,cba}

Bijectivity

Theorem: For each word uA*, there exists a multiset SA*

such that ebwt(S)=(u,I) for some set I of indices. That is, ebwtis surjective.

For instance, bccaaab is not a bwt image of any word in A*, however ebwt(ab, abcac)=(bccaaab, {1,2}).If we don’t care about the indices we obtain the Gessel and Reutenauer Theorem.

Theorem: The transformation ebwt is injective.

THEOREM (Gessel-Reutenauer): There exists a bijection between A* and the family of multisets of conjugacy classes of primitive words in A*.

Relation between the bwt and its extension

• if S={u}, then ebwt(S)=bwt(u)

• Consider the words u=aaabbb and v=abaabbb. Thenebwt(u,v) = b b a b a a b a b b b b a a Notice that bwt(u)=baabbba and bwt(v)=baabbba

correspond to the blue and the red subsequences of ebwt(u,v), respectively.

Actually the ebwt of a multiset of strings S is one element of the shuffle of the bwt’s of the words in S.

The distance measures based on the ebwt considers “how much” the characters coming from u (the red ones) and from v (the blue ones) are shuffled in the ebwt(u,v).

Sequence comparison

u=CGGGCACACACTCGCTCACACACTG v=CCCGCTCGACGCACACACTGCACAT

ACACACTCGTTGGGCCCGCGGGCACACACTGCCGTTCGATCGTCCCACACTCGTTGGGCCCGCGGGCACACACTGCCGTTCGATCGTCCCACACTCGTTGGGCCCGCGGGCACACACTGCCGTTCGATCGTCCCACACATCGTCCCACACACTGCCGTTCGCACACACTCGTTGGGCCCGCGGGCACACACTGCCGTTCGATCGTCCCACACTCGTTGGGCCCGCGGGCACACACTGCCGTTCGATCGTCCCACACTCGTTGGGCCCGCGGGCACACACTGCCGTTCGATCGTCCCACACCACACACTGCCGTTCGATCGTC

CGTTGGGCCCGCGGGCACACACTCTCGTTGGGCCCGCGGGCACACACTGCCGTTCGATCGTCCCACACAGATCGTCCCACACACTGCCGTTC

TGGGCCCGCGGGCACACACTCGTTTCGATCGTCCCACACACTGCCGTTGGGCCCGCGGGCACACACTCG

The distance measure between words

Definition: if u=u1u2…un, then ui,j=uiui+1…uj. We say that ui,j is a monotonic block if all characters uk, with k=i,…,j are equal, and ui,j is maximal with respect to this property.

Let u and v be two words and let P=B1,B2,…,Bm a parsing of ebwt(u,v) into monotonic blocks.

Denote by the number of the characters coming from u in the monothonic block Bi. We define the distance measure

m

iii vcucvu

1

|)()(|),(

)(uci

Example

Consider the words u=aaabbb and v=abaabbb. Then

b b a b a a b a b b b b a a 0 1 1 0 1 1 0 0

8

1

4|)()(|),(i

ii vcucvu

|)()(| vcuc ii

),( vuebwt

Properties of the distance 1. (u,v)= (v,u), that is, the measure is symmetric.. (u,v)=0 if and only if u and v are conjugate.3. If u’ is a conjugate of u and v’ is a conjugate of v, then

(u,v)= (u’,v’). That is, is a distance between conjugacy classes.

4. If S={u1, u2, …, uk}, with k2, it is possible to compute the distance between all the pairs of sequences in S by applying ebwt just once to the whole multiset S. (see next slide)

does not satisfy the triangle inequality. (see next slide)

Multiple sequence comparisonConsider the words u=aaabbbb, v=ababbb, z=abababb

ebwt(u,v,z)=b b a b b b b a a b a b b a a b b a b a a

ebwt(u,v)=bbabaababbbbaa ebwt(u,z)=babbbabbaababa ebwt(v,z)=bbbbbaabbaabaa

(u,v)=4 (u,v)=12 (u,z)=6

Triangle inequality is not satisfied by this triplet

Multiple sequence comparisonIt is possible to compare all pairs of strings taken from a set S,by performing a unique transformation ebwt(S). In fact each transformation of a pair of words in S can be obtained from ebwt(S) by simply deleting all characters not belonging to the considered pair of words.

In this way, in order to find all pairwise distances of sequences of a set S of k words of length n, we just need to perform a single sorting of kn sequences of length n, instead than k2 sortings of 2n sequences of length n. The memory requirement is O(kn2).

We have applied this method in order to classify different animal species by comparing their mtDNA

An Application: Comparison of biological sequences

• The distance measure is an example of alignment free distance measure.

• It takes into account (like the other alignment free distances) “how many” factors the compared sequences share.

• The distance measures how dissimilar two conjugacy classes of sequences (or, analogously, circular words) are.

• In order to test our method we have applied the normalized version of our distance to the whole mitochondrial genome phylogeny. (remark that mithocondrial dna is circular by nature)

• The results we have obtained are very close to the ones derived, with other approaches in most of the papers in which the considered species are the same.

Evolutionary tree built from mtDNA sequences of 20 species

by using the neighbor-joining clustering method

primates

ferungulatesrodents

Eutheria

MetatheriaProrotheria

(Prototheria, Metatheria, (Rodents, (Primates, Ferungulates)))

Work in progress

• Comparison of our similarity measures with other methods for classifying biological sequences.

• Other criteria for comparing sequences by using the extended Burrows-Wheeler Transform

ReferencesMantaci, Restivo, Rosone, Sciortino. A new combinatorial approach to sequence comparison. Sibmitted to TOCS. Special issue for ICTCS06.

The extended Burrows-WheelerTransform

A new combinatorial approach to sequence comparson Sabrina Mantaci University of Palermo Joint work...

Documents

Transcript of A new combinatorial approach to sequence comparson Sabrina Mantaci University of Palermo Joint work...