Computational Genomics Lecture #3a

25
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il /~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. Background Readings : Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997. Chapter 15 in Gusfield’s book. p. 81 in Kanehisa’s book Multiple sequence alignment

description

Computational Genomics Lecture #3a. Multiple sequence alignment. Background Readings : Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis , Durbin et al., 2001. Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology , Setubal and Meidanis, 1997. - PowerPoint PPT Presentation

Transcript of Computational Genomics Lecture #3a

Page 1: Computational Genomics Lecture #3a

Computational GenomicsLecture #3a

Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.

Background Readings: Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001.Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997. Chapter 15 in Gusfield’s book. p. 81 in Kanehisa’s book

Multiple sequence alignment

Page 2: Computational Genomics Lecture #3a

Ladies and GentlemenBoys and Girlsthe holy grail

Multiple Sequence Alignment

Page 3: Computational Genomics Lecture #3a

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

Page 4: Computational Genomics Lecture #3a

Multiple Sequence Alignment

Definition: Given strings S1, S2, …,Sk a multiple (global) alignment map them to strings S’1, S’2, …,S’k that may contain blanks, where:

1. |S’1|= |S’2|=…= |S’k|

2. The removal of spaces from S’i leaves Si

Aligning more than two sequences.

Page 5: Computational Genomics Lecture #3a

Multiple alignmentsWe use a matrix to represent the alignment of k sequences, K=(x1,...,xk). We assume no columns consists solely of blanks.

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

The common scoring functions give a score to each column, and set: score(K)= ∑i score(column(i))

For k=10, a scoring function has 2k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V).

x1

x2

x3

x4

Page 6: Computational Genomics Lecture #3a

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj).

In order for this score to be written as ∑i score(column(i)),we set score(-,-) = 0. Why ?

Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines).

Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).

Page 7: Computational Genomics Lecture #3a

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

Definition: The sum-of-pairs (SP) value for a multiple global

alignment A of k strings is the sum of the values of all projected

pairwise alignments induced by A where the pairwise alignment

function score(xi,xj) is additive.

2

k

Page 8: Computational Genomics Lecture #3a

Example

Consider the following alignment:

a c - c d b -

- c - a d b d

a - b c d a d

Using the edit distance and for ,

this alignment has a SP value of

0, xx 1, yx yx

33 +43 + 4 + 5 = 12

Page 9: Computational Genomics Lecture #3a

Multiple Sequence AlignmentGiven k strings of length n, there is a natural generalization of the

dynamic programming algorithm that finds an alignment that maximizes

SP-score(K) = ∑i<j score(xi,xj).

Instead of a 2-dimensional table, we now have a k-dimensional table to fill.

For each vector i =(i1,..,ik), compute an optimal multiple alignment for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik).

The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2k-1 adjacent entries.

Page 10: Computational Genomics Lecture #3a

The idea via K=2

])[,(],[

)],[(],[

])[],[(],[

max],[

1jtj1iV

1is1jiV

1jt1isjiV

1j1iV

])..[],..[(],[ j1ti1sdjiV

V[i,j] V[i+1,j]

V[i,j+1] V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2k-1 non-zero binary vectors (1,1), (1,0), (0,1).

Recall the notation:

and the following recurrence for V:

Page 11: Computational Genomics Lecture #3a

Multiple Sequence AlignmentGiven k strings of length n, there is a generalization of the dynamic

programming algorithm that finds an optimal SP alignment.

Computational Cost:

• Instead of a 2-dimensional table we now have a k-dimensional table to fill.

• Each dimension’s size is n+1. Each entry depends on 2k-1 adjacent entries.

Number of evaluations of scoring function : O(2knk)

Page 12: Computational Genomics Lecture #3a

Complexity of the DP approachNumber of cells nk.

Number of adjacent cells O(2k).Computation of SP score for each column(i,b) is o(k2)

Total run time is O(k22knk) which is totally unacceptable !

Maybe one can do better?

Page 13: Computational Genomics Lecture #3a

But MSA is Intractable

Not much hope for a polynomial algorithm because the problem has been shown to be NP complete. Proof is quite

tricky and quite recent. Some previous proofs were bogus.

Isaac Elias provided an apparently correct proof.

Need heuristic or approximation to reduce time.

Page 14: Computational Genomics Lecture #3a

Tree AlignmentsAssume that there is a tree T=(V,E) whose leaves are the input sequences. • Want to associate a sequence in each internal node.• Tree-score(K) = ∑(i,j)Escore(xi,xj).

Finding the optimal assignment of sequences to the internal nodes is NP Hard.

We will meet this problem again in the study ofphylogenetic trees (it is related to the parsimony problem).

Page 15: Computational Genomics Lecture #3a

Multiple Sequence Alignment Heuristics

similar

Perform all 6 pair wise alignments. Find scores.Build a “similarity tree”.

A.

B. Multiple alignment following the tree from A.

Example - 4 sequences A, B, C, D.

ABCD

BDAC

Align most similar pairs allowing gaps to optimize alignment.

B

D

A

CAlign the next most similar pair.

Now, “align the alignments”, introducing gaps if necessary to optimize alignment of (BD) with (AC).

distant

Page 16: Computational Genomics Lecture #3a

The tree-based progressive method for multiple sequence alignment, used in practice (Clustal)

(a) a tree (dendrogram) obtained by “cluster analysis” (b) pairwise alignment of sequences’ alignments.

(a)

DEHUG3

DEPGG3

DEBYG3

DEZYG3

DEBSGF

(b) L W R D G R G A L Q

L W R G G R G A A Q

D W R - G R T A S G

L R R - A R T A S A

L - R G A R A A A E

(modified from Speed’s ppt presentation,see p. 81 in Kanehisa’s book)

Page 17: Computational Genomics Lecture #3a

Visualization of Alignment

Page 18: Computational Genomics Lecture #3a

Multiple Sequence Alignment – Approximation Algorithm

Now we will see an O(k2n2) multiple alignment algorithm for the SP-score that approximatethe optimal solution’s score by a factor of at most 2(1-1/k) < 2.

Page 19: Computational Genomics Lecture #3a

Star AlignmentsRather then summing up all pairwise alignments, select a fixed sequence S1 as a center, and set

Star-score(K) = ∑j>0score(S1,Sj).

The algorithm to find optimal alignment: at each step, add another sequence aligned with S1, keeping old gaps and possibly adding new ones (i.e. keeping old alignment intact).

Page 20: Computational Genomics Lecture #3a

Multiple Sequence Alignment – Approximation Algorithm

Polynomial time algorithm:

assumption: the function δ is a distance function:

• (triangle inequality)

Let D(S,T) be the value of the minimum global alignment between S and T.

0),( xx),(),(),( yxzyzx

0),(),( xyyx

Page 21: Computational Genomics Lecture #3a

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Polynomial time algorithm:

The input is a set Γ of k strings Si.

1. Find “center string” S1 that minimizes S

1D S ,S

2. Call the remaining strings S2, …,Sk.

3. Add a string to the multiple alignment that initially contains only S1 as follows:

• Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1. Add Si by running dynamic programming algorithm on S’1 and Si to produce S’’1 and S’i.

• Adjust S’2, …,S’i-1 by adding gaps to those columns where gaps were added to get S’’1 from S’1.

• Replace S’1 by S’’1.

Page 22: Computational Genomics Lecture #3a

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Time analysis:

• Choosing S1 – running dynamic programming algorithm

times – O(k2n2)

• When Si is added to the multiple alignment, the length of S1

is at most i* n, so the time to add all k strings is

2

k

21

2

1

k

i

O k nO in n

Page 23: Computational Genomics Lecture #3a

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Performance analysis:

• M - The alignment produced by this algorithm.

For all i, d(1,i)=D(S1,Si)

(we performed optimal alignment between S’1 and Si and )0( , )

1 1

,k k

i jj i

v M d i j

• d(i,j) - the distance M induces on the pair Si,Sj.

• M* - optimal alignment.

Page 24: Computational Genomics Lecture #3a

Multiple Sequence Alignment – Approximation Algorithm (cont.)

Performance

analysis:

k

llSSDk

21,)1(2

k

jjSSDk

21,

*

( ) 2( 1)2

( )

v M k

v M k

k

i

k

ijj

jidMv1 1

, jdidk

i

k

ijj

,1,11 1

k

l

ldk2

,1)1(2

Triangle inequality

k

i

k

ijj

jidMv1 1

** ,

k

i

k

ijj

ji SSD1 1

,

k

i

k

ijj

jSSD1 1

1,

Definition of S1

Page 25: Computational Genomics Lecture #3a

Multiple Sequence Alignment – Approximation Algorithm

Algorithm relies heavily on scoring function

being a distance. It produced an alignment

whose SP score is at most twice the minimum.

What if scoring function was similarity?

Can we get an efficient algorithm whose score

is half the maximum? Third of maximum? …

We dunno !