Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were...

Multiple Sequence Alignment

Algorithms in Computational BiologySpring 2006

Most of the slides were created by Dan Geiger and Ydo Wexler and edited by Itai Sharon, other created by Itai Sharon

2

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAAC

GGG

CCC

A-T

G--

TTA

-TA

-G-

GTT

CCA

AG-

GTG

T-A

--A

-GC

Possible alignment

Possible alignment

3

Motivation

Construction of phylogenetic trees Requires that sites being compared are homologous

Extraction of conserved regions in proteins

Construction of profiles characteristic for a protein family

Repetitive sequences in DNA

4

Multiple Sequence Alignment (MSA)

Definition Given strings s1, s2, …, sk an MSA algorithm maps

them to strings s’1, s’2, …, s’k that may contain gaps, where:

|s’1| = |s’2| = … = |s’k|

The removal of gaps from s’i leaves si

Note It is usually convenient to represent an MSA as a

matrix with k rows and |s’i| columns No column may consist solely of gaps

5

Assigning scores to an MSA

We will consider additive functions only

Points to consider regarding a scoring function Should not be dependent if the on the order of arguments Should reward the presence of many equal/strongly related

residues and penalize unrelated residues and spaces

In pairwise alignment the score is simply the sum of similarity scores of corresponding letters

What is the “best” way to measure the similarity of k>2 letters?

6

Sum of Pairs (SP)

The sum of pairs score of an MSA is the sum of scores of

all pairwise alignments induced by it

Example:

Using a cost function (x, x) = 0 and (x, y) = 1 for x ≠ y this alignment has a SP value of

- c - a d b -

a - b c d a d

a c - c d b -

4 + 6 + 2 = 12

2

k

7

Sum of Pairs

SP tends to overcount mutations. For instance: Assume that our column consists of (A, A, A, T) and that

(x, x) = 1, (x, y) = -1 The score for the column will be

3*(A, A) + 3*(A, T) = 3 – 3 = 0

While this could be explained by a single mutation:

AAA T

8

How to Perform MSA?

Multidimensional dynamic programming

Tree alignments

Star alignments

Progressive alignment

9

Multidimensional DP Alignment

Given k strings of length n, there is a natural generalization of the DP algorithm

Instead of a 2-dimensional table, we now have a k-dimensional table to fill

For each cell V(i), i=(i1,.., ik), compute an optimal multiple alignment for the k prefix sequences s1(1,.., i1),..., sk(1,.., ik)

The adjacent cells are all cells V(i-b), where bi{0,1} and bi≠0.Each cell depends on 2k-1 adjacent cells

Use the SP-score for computing the score

10

Multidimensional DP Alignment

What’s the price? Number of cells to fill: O(nk) Number of dependencies of each cell: 2k-1 Time to compute the SP-score: O(k2)

In fact, the optimal SP-alignment problem was shown to be NP-complete!

Well, these sequences need to be aligned… what can we do?

Complexity: O(k22knk)

11

Time Saving Heuristics – Relevance Tests

Idea: Avoid computing score(i) for irrelevant cells

Compute a lower bound L on the optimal alignment Any efficient approximation algorithm can be used

For each cell V(i) compute an upper bound U on the best alignment that goes through it

Ignore the cell if U<L

12


How do we compute the upper bound U for cell V(i)?

For cell i=(…,iu,…,iv,…) do the following: For each two indices 1 u < v k compute the optimal

score of a pairwise alignment of su and sv, which goes via cell i

Compute

Claim: U is an upper bound on the best MSA that goes through cell i

kvu1

vvuu n))..i(1..sn),..i(1..score(sU

13


How do we compute the optimal route? Recall the space efficient algorithm for pairwise alignment.

can we go over all cells determine if they are relevant or not? No. Start with (0,…,0) and add to the list relevant entries

until reaching (n1,…,nk)

What is the new time complexity? For each potential cell we’ve added O(k2n2) operations Depending on the quality of L we’ve eliminated (hopefully)

many cells

14

Tree Alignments – Structure

Input A set of k sequences S= {s1, s2, …, sk} Topology of the tree T whose leaves are the

members of S

Algorithm Find an assignment of sequences for the interior

nodes of the tree that optimizes the overall score For each edge e=(vi,vj) of T, its weight w(e) is the pairwise

alignment score of vi and vj

The overall score is defined by

Te

w(e)score(T)

15

Tree Alignments – an Example

Suppose that We’re given the following tree:

Given that (x, x)=1, (x, y)=0 and (x, -)=-1, the overall score of the alignment is

score(T)=2+3+1=6

CAT

GT

CTG

CG

CT

1

1

2

CG

2

1

+3

1

+1=6

16

Tree Alignments – Notes

The MSA can be recovered from the alignments on the different edges

Overall score of the alignment is not SP

The tree alignment problem is NP-hard There exists an algorithm that finds an optimal alignment

in time exponential in the number of sequences

Tree alignment algorithm are applicable only when a tree topology is known

17

Star Alignments – Structure

Choose a sequence s* that will serve as the center of the star How to choose: try all sequences, choose the one whose

distances from all the rest is the smallest, etc.

Add other sequences by aligning them to s* Add gaps to already aligned sequences

when necessary Never remove a gap (“Once a gap,

always a gap”)

s3s4

s2s5

s6

s1

18

The Center Star Method

Publication Gusfield, 1993

Assumption The cost function δ is a distance function that satisfies:

(x, y) = (y, x) ≥ 0 (x, x) = 0 (x, z) + (z, y) ≥ (x, y)

Algorithm Runs in polynomial time alignment’s score is less than twice the score of the

optimal alignment

19

The Center Star Method – Definitions

Definitions M - the alignment produced by the algorithm M* – the best alignment, namely the one that gets the

lowest score d(i, j), d*(i, j) – the distance induced by M (M*) on (si, sj)

DP(si, sj) – minimum pairwise alignment score v(M) - score for alignment M:

Note that it is always true that d(i, j) ≥ DP(i, j)

k

i

k

ijj1 1

ji,dMv

20

The Center Star Method

Input A set of k sequences S = {s1, …, sk}

Algorithm Find the center s* = . Suppose s*= s1

for i=2 to k do: Suppose s1, …, si-1 are already aligned as s’1, …, s’i-1 Align si against s’1 by running the DP algorithm to produce

the alignment (s”1, s’i)

Adjust s’2, …, s’i-1 to s”1 by adding gaps to those columns where gaps were added to get s”1 from s’1.

Replace s’1 by s”1, add s’i.

end for

SsSsj

)sDP(s,argmin j

21

The Center Star Method – Time Analysis

Choosing s*

running the DP algorithm times – o(k2n2)

Adding s2, …, sk to the MSA In step i the length of s’* is at most i·n Aligning s’* with si takes o(i·n2) time Performing k-1 such alignments takes o(k2n2) time:

Overall time complexity: o(k2n2)

1

1

221

1

2k

i

k

i

nkOiOnninO

2

k

22

k

i

k

ijj1 1

j)d(i,v(M)

The Center Star Method – Error Analysis

k

i

k

j1 2j1 )s,DP(s

Definition of S1

Triangle inequality

k

i

k

ijj1 1

j),1d(i),1d(

k

i

d2

i),1()1(k2

k

i

k

ijj1 1

** j)(i,d)v(M

k

i

k

ijj1 1

ji )s,DP(s

k

j 2j1 )s,DP(sk

2k

1)2(k

)v(M

v(M)*

k

i 2i1 )s,DP(s)1(k2

d(1,i)=DP(1,i)

23

Progressive Alignments

Idea successively align pairs of sequences using pairwise

alignment algorithms

General structure Choose two sequences and align them using a pairwise

alignment algorithm Choose another sequence and align it to the current

alignment Repeat the previous stage as long as there are sequences

left

24

Progressive Alignments

Differences between algorithms Choosing the next sequence Progression involves aligning sequences vs. alignments

only, or also alignments vs. alignments Scoring methods

Progressive alignment algorithms Clustal W T-Coffee

25

CLUSTAL W

Publication Thompson et al., 1994

The algorithm consists of three stages: Distance matrix construction, by pairwise alignment of

each pair of sequences Guide tree construction from the distance matrix Progressive alignment of the sequences according to

the branches in the guide tree

More on ClustalW – next week…

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were...

Documents

Transcript of Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were...