Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

48
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha

Transcript of Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Page 1: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Dynamic Programming: Sequence alignment

CS 466

Saurabh Sinha

Page 2: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

DNA Sequence Comparison: First Success Story

• Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function

• In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene

• A normal growth gene switched on at the wrong time causes cancer !

Page 3: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Cystic Fibrosis • Cystic fibrosis (CF) is a chronic and frequently fatal genetic

disease of the body's mucus glands. CF primarily affects the respiratory systems in children.

• Search for the CF gene was narrowed to ~1 Mbp, and the region was sequenced.

• Scanned a database for matches to known genes. A segment in this region matched the gene for some ATP binding protein(s). These proteins are part of the ion transport channel, and CF involves sweat secretions with abnormal sodium content!

Page 4: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Role for Bioinformatics• Gene similarities between two genes with known

and unknown function alert biologists to some possibilities

• Computing a similarity score between two genes tells how likely it is that they have similar functions

• Dynamic programming is a technique for revealing similarities between genes

Page 5: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Motivating Dynamic Programming

Page 6: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Dynamic programming example:Manhattan Tourist Problem

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

*

*

*

*

*

**

* *

*

*

Source

*

Page 7: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

*

*

*

*

*

**

* *

*

*

Source

*

Dynamic programming example:Manhattan Tourist Problem

Page 8: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Manhattan Tourist Problem: Formulation

Goal: Find the longest path in a weighted grid.

Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink”

Output: A longest path in G from “source” to “sink”

Page 9: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: An Example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinatei c

oor

dina

te

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

419

95

15

23

0

20

3

4

Page 10: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Greedy Algorithm Is Not Optimal

1 2 5

2 1 5

2 3 4

0 0 0

5

3

0

3

5

0

10

3

5

5

1

2promising start, but leads to bad choices!

source

sink18

22

Page 11: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Simple Recursive Program

MT(n,m) if n=0 or m=0 return MT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to

(n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to

(n,m) return max{x,y}What’s wrong with this approach?

Page 12: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Here’s what’s wrong

• M(n,m) needs M(n, m-1) and M(n-1, m)

• Both of these need M(n-1, m-1)

• So M(n-1, m-1) will be computed at least twice

• Dynamic programming: the same idea as this recursive algorithm, but keep all intermediate results in a table and reuse

Page 13: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

1

5

0 1

0

1

i

source

1

5S1,0 = 5

S0,1 = 1

• Calculate optimal path score for each vertex in the graph

• Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between

MTP: Dynamic Programmingj

Page 14: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Dynamic Programming (cont’d)

1 2

5

3

0 1 2

0

1

2

source

1 3

5

8

4

S2,0 = 8

i

S1,1 = 4

S0,2 = 33

-5

j

Page 15: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Dynamic Programming (cont’d)

1 2

5

3

0 1 2 3

0

1

2

3

i

source

1 3

5

8

8

4

0

5

8

103

5

-5

9

131-5

S3,0 = 8

S2,1 = 9

S1,2 = 13

S3,0 = 8

j

Page 16: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Dynamic Programming (cont’d)

greedy alg. fails!

1 2 5

-5 1 -5

-5 3

0

5

3

0

3

5

0

10

-3

-5

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

9

12

S3,1 = 9

S2,2 = 12

S1,3 = 8

j

Page 17: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Dynamic Programming (cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

S3,2 = 9

S2,3 = 15

Page 18: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Dynamic Programming (cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

0

1

16S3,3 = 16

(showing all back-traces)

Done!

Page 19: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

MTP: Recurrence

Computing the score for a point (i,j) by the recurrence relation:

si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)

The running time is n x m for a n by m grid

(n = # of rows, m = # of columns)

Page 20: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Manhattan Is Not A Perfect Grid

What about diagonals?

• The score at point B is given by:

sB =max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

B

A3

A1

A2

Page 21: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Manhattan Is Not A Perfect Grid (cont’d)

Computing the score for point x is given by the recurrence relation:

sx = max

of

sy + weight of vertex (y, x) where

y є Predecessors(x)

• Predecessors (x) – set of vertices that have edges leading to x

•The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once

Page 22: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Traveling in the Grid

•By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble.

•We need to traverse the vertices in some order

•For a grid, can traverse vertices row by row, column by column, or diagonal by diagonal

Page 23: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Traversing the Manhattan Grid • 3 different strategies:

– a) Column by column– b) Row by row– c) Along diagonals

a) b)

c)

Page 24: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Traversing a DAG

• A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label

• How to obtain a topological ordering ?

Page 25: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Alignment

Page 26: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Aligning DNA Sequences

V = ATCTGATGW = TGCATAC

n = 8

m = 7

A T C T G A T G

T G C A T A CV W

match

deletioninsertion

mismatch

indels

4122

matches mismatches insertions deletions

Alignment : 2 x k matrix ( k m, n )

Page 27: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Longest Common Subsequence (LCS) – Alignment without Mismatches

• Given two sequences

v = v1 v2…vm and w = w1 w2…wn

• The LCS of v and w is a sequence of positions in

v: 1 < i1 < i2 < … < it < m

and a sequence of positions in

w: 1 < j1 < j2 < … < jt < n

such that it -th letter of v equals to jt-letter of w and t is maximal

Page 28: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

LCS: Example

A T -- C T G A T C

-- T G C T -- A -- C

elements of v

elements of w

--

A1

2

0

1

2

2

3

3

4

3

5

4

5

5

6

6

6

7

7

8

j coords:

i coords:

Matches shown in redpositions in v:

positions in w:

2 < 3 < 4 < 6 < 8

1 < 3 < 5 < 6 < 7

Every common subsequence is a path in 2-D grid

0

0

(0,0)(1,0)(2,1)(2,2)(3,3)(3,4)(4,5)(5,5)(6,6)(7,6)(8,7)

Page 29: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Computing LCSLet vi = prefix of v of length i: v1 … vi

and wj = prefix of w of length j: w1 … wj

The length of LCS(vi,wj) is computed by:

si, j = maxsi-1, j

si, j-1

si-1, j-1 + 1 if vi = wj

Page 30: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

LCS Problem as Manhattan Tourist Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Page 31: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Edit Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Page 32: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Edit Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Every path is a common subsequence.

Every diagonal edge adds an extra element to common subsequence

LCS Problem: Find a path with maximum number of diagonal edges

Page 33: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Backtracking

• si,j allows us to compute the length of LCS for v i

and wj

• sm,n gives us the length of the LCS for v and w• How do we print the actual LCS ?• At each step, we chose an optimal decision si,j =

max (…)• Record which of the choices was made in order to

obtain this max

Page 34: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Computing LCSLet vi = prefix of v of length i: v1 … vi

and wj = prefix of w of length j: w1 … wj

The length of LCS(vi,wj) is computed by:

si, j = maxsi-1, j

si, j-1

si-1, j-1 + 1 if vi = wj

Page 35: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Printing LCS: Backtracking

1. PrintLCS(b,v,i,j)2. if i = 0 or j = 03. return4. if bi,j = “ “5. PrintLCS(b,v,i-1,j-1)6. print vi

7. else

8. if bi,j = “ “9. PrintLCS(b,v,i-1,j)10. else11. PrintLCS(b,v,i,j-1)

Page 36: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

From LCS to Alignment

• The Longest Common Subsequence problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches).

• In the LCS Problem, we scored 1 for matches and 0 for indels

• Consider penalizing indels and mismatches with negative scores

• Simplest scoring scheme: +1 : match premium -μ : mismatch penalty -σ : indel penalty

Page 37: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Simple Scoring

• When mismatches are penalized by –μ, indels are penalized by –σ,

and matches are rewarded with +1,

the resulting score is:

#matches – μ(#mismatches) – σ (#indels)

Page 38: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

The Global Alignment Problem

Find the best alignment between two strings under a given scoring schema

Input : Strings v and w and a scoring schemaOutput : Alignment of maximum score

→ = -σ = 1 if match = -µ if mismatch

si-1,j-1 +1 if vi = wj

si,j = max s i-1,j-1 -µ if vi ≠ wj

s i-1,j - σ s i,j-1 - σ

: mismatch penaltyσ : indel penalty

Page 39: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Scoring Matrices

To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ.

In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”.

This will simplify the algorithm as follows:

si-1,j-1 + δ (vi, wj)

si,j = max s i-1,j + δ (vi, -)

s i,j-1 + δ (-, wj)

Page 40: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Making a Scoring Matrix

• Scoring matrices are created based on biological evidence.

• Alignments can be thought of as two sequences that differ due to mutations.

• Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.

Page 41: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Scoring Matrix: ExampleA R N K

A 5 -2 -1 -1

R - 7 -1 3

N - - 7 0

K - - - 6

• Notice that although R and K are different amino acids, they have a positive score.

• Why? They are both positively charged amino acids will not greatly change function of protein.

Page 42: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Conservation

• Amino acid changes that tend to preserve the physico-chemical properties of the original residue– Polar to polar

• aspartate glutamate– Nonpolar to nonpolar

• alanine valine– Similarly behaving residues

• leucine to isoleucine

Page 43: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Local vs. Global Alignment

• The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph.

• The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

• In the edit graph with negatively-scored edges, Local Alignment may score higher than Global Alignment

Page 44: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Local Alignment: Example

Global alignment

Local alignment

Compute a “mini” Global Alignment to get Local Alignment

Page 45: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Local Alignments: Why?

• Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions.

• Example:– Homeobox genes have a short region

called the homeodomain that is highly conserved between species.

– A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

Page 46: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

The Local Alignment Problem

• Goal: Find the best local alignment between two strings

• Input : Strings v, w and scoring matrix δ

• Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings

Page 47: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

The Problem Is …

• Long run time O(n4):

- In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source.

- For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time.

• This can be remedied by allowing every point to be the starting point

Page 48: Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

The Local Alignment Recurrence• The largest value of si,j over the whole edit

graph is the score of the best local alignment.

• The recurrence:

0 si,j = max si-1,j-1 + δ (vi, wj)

s i-1,j + δ (vi, -)

s i,j-1 + δ (-, wj)

Notice there is only this change from the original recurrence of a Global Alignment