Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B:...

28
Alignment II Dynamic Programming

Transcript of Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B:...

Page 1: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

Alignment IIDynamic Programming

Page 2: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

2

Pair-wise sequence alignments

A: C A T - T C A - C

| | | | |

B: C - T C G C A G C

Idea: Display one sequence above another with spaces inserted in both to reveal similarity

Page 3: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

3

Two types of alignment

S = CTGTCGCTGCACGT = TGCCGTG

CTGTCGCTGCACG---------TGC-CGTG

CTGTCG-CTGCACG

-TGC-CG-TG----

Global alignment Local alignment

Page 4: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

4

Global alignment: ScoringCTGTCG-CTGCACG

-TGC-CG-TG----

Reward for matches: Mismatch penalty: Space penalty:

score(A) = w – x - y

w = #matches x = #mismatchesy = #spaces

Page 5: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

5

Global alignment: Scoring

C T G T C G – C T G C - T G C – C G – T G -

-5 10 10 -2 -5 -2 -5 -5 10 10 -5

Total = 11

Reward for matches:10

Mismatch penalty: 2Space penalty: 5

Page 6: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

6

Optimum Alignment

• The score of an alignment is a measure of its quality

• Optimum alignment problem: Given a pair of sequences X and Y, find an alignment (global or local) with maximum score

• The similarity between X and Y, denoted sim(X,Y), is the maximum score of an alignment of X and Y

Page 7: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

7

Alignment algorithms

• Global: Needleman-Wunsch• Local: Smith-Waterman• NW and SW use dynamic

programming• Variations:

– Gap penalty functions– Scoring matrices

Page 8: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

8

Global Alignment: Algorithm

1..j1..i T and S of alignment optimum of Cost),( jiC

T of jlength of Prefix

S of i length of Prefix

..1

..1

j

i

T

S

ba

babaw

if

if),(

Page 9: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

9

)1j,i(C

)j,1i(C

)T,S(w)1j,1i(C

max)j,i(Cji

j)j,0(Ci)0,i(C

Initial conditions:

Recurrence relation: For 1 i n, 1 j m:

Theorem. C(i,j) satisfies the following relationships:

Page 10: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

10

Justification

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj-1 Tj

C(i-1,j-1) + w(Si,Tj)

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj —

C(i-1,j)

S1 S2 . . . Si —

T1 T2 . . . Tj-1 Tj

C(i,j-1)

Page 11: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

11

Example

Case 1: Line up Si with Tj

S: C A T T C A C T: C - T T C A G

i - 1 i

jj -1

S: C A T T C A - C T: C - T T C A G -

Case 2: Line up Si with spacei - 1 i

j

S: C A T T C A C - T: C - T T C A - G

Case 3: Line up Tj with spacei

jj -1

Page 12: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

12

Computation Procedure

C(n,m)

C(0,0)

C(i,j)

)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji

C(i-1,j)C(i-1,j-1)

C(i,j-1)

Page 13: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

13

λ C T C G C A G C

A

C

T

T

C

A

C

+10 for match, -2 for mismatch, -5 for space

0 -5 -10 -15 -20 -25 -30 -35 -40

-5

-10

-15

-20

-25

-30

-35

10 5

λ

Page 14: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

14

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

A

C

T

T

C

A

C

λ

Traceback can yield both optimum alignments

**

Page 15: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

15

End-gap free alignment

• Gaps at the start or end of alignment are not penalized

Best global Best end-gap free

Match: +2 Mismatch and space: -1

Score = 1 Score = 9

Page 16: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

16

Motivation: Shotgun assembly

• Shotgun assembly produces large set of partially overlapping subsequences from many copies of one unknown DNA sequence.

• Problem: Use the overlapping sections to ”paste” the subsequences together.

• Overlapping pairs will have low global alignment score, but high end-space free score because of overlap.

Page 17: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

17

Motivation: Shotgun assembly

Page 18: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

18

Algorithm

• Same as global alignment, except:– Initialize with zeros (free gaps at

start)– Locate max in the last row/column

(free gaps at end)

Page 19: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

19

10 5 10 5 10 5 0 10

λ C T C G C A G C

A

C

T

T

C

A

G

+10 for match, -2 for mismatch, -5 for gap

0 0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

λ

5 8 5 8 5 20 15 10

0 15 10 5 6 15 18 13

-2 10 13 8 3 10 13 16

10 5 20 15 18 13 8 23 5 8 15 18 13 28 23 18

0 3 10 25 20 23 38 33

Page 20: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

20

Local Alignment: Motivation

• Ignoring stretches of non-coding DNA:– Non-coding regions are more likely to be

subjected to mutations than coding regions.– Local alignment between two sequences is

likely to be between two exons.

• Locating protein domains:– Proteins of different kind and of different

species often exhibit local similarities– Local similarities may indicate ”functional

subunits”.

Page 21: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

21

Local alignment: Example

Best local alignment:

Match: +2 Mismatch and space: -1

Score = 5

S = g g t c t g a gT = a a a c g a

g g t c t g a ga a a c – g a -

Page 22: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

22

Local Alignment: Algorithm

Initialize top row and leftmost column to zero.

0

1,

,1

,]1,1[

max ,

jiC

jiC

jtisscorejiC

jiC

C [i, j] = Score of optimally aligning a suffix of s with a suffix of t.

Page 23: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

23

0 0 0 0 0 0 0 0 0

0 1 0 1 0 1 0 0 1

0 0 0 0 0 0 2 0 0

0 0 1 0 0 0 0 1 0

0 0 1 0 0 0 0 0 0

0 1 0 2 0 1 0 0 1

0 0 0 0 1 0 2 0 0

0 1 0 1 0 2 0 1 1

λ C T C G C A G C

A

C

T

T

C

A

C

λ

+1 for a match, -1 for a mismatch, -5 for a space

Page 24: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

24

Some Results

• Most pairwise sequence alignment problems can be solved in O(mn) time.

• Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88].

• Highly similar sequences can be aligned in O(dn) time, where d measures the distance between the sequences [Landau86].

Page 25: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

25

Reducing space requirements

• O(mn) tables are often the limiting factor in computing large alignments

• There is a linear space technique that only doubles the time required [Hirschberg77]

Page 26: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

26

0 10 5 10 5 10 5 0 10

λ C T C G C A G C

A

C

T

T

C

A

G

IDEA: We only need the previous row to calculate the next

0 0 0 0 0 0 0 0 0λ

0 5 8 5 8 5 20 15 10

Page 27: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

27

Linear-space Alignments

mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn

Page 28: Alignment II Dynamic Programming. 2 Pair-wise sequence alignments A: C A T - T C A - C | | | | | B: C - T C G C A G C Idea: Display one sequence above.

28

Affine Gap Penalty Functions

Gap penalty = h + gk

where

k = length of gaph = gap opening penaltyg = gap continuation penalty

Can also be solved in O(nm) time using dynamic programming