Dynamic Programming (Edit Distance). Edit Distance Input: Two input strings S1 (of size n) and S2...

30
Dynamic Programming (Edit Distance)

description

Example S1:TCGACGTCA S2: TGACGTGC Three operations to convert S1 to S2: S1:TCGACGTGCA S2: T GACGTGC – Delete C (position 2) and A (position 10) – Insert G (position 8)

Transcript of Dynamic Programming (Edit Distance). Edit Distance Input: Two input strings S1 (of size n) and S2...

Page 1: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Dynamic Programming(Edit Distance)

Page 2: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance• Input:

– Two input strings S1 (of size n) and S2 (of size m)• E.g., S1 = ATTTCTAGTGGGTAAA• S2 = ATCTAGTTTAGGGATA

• Target:– Find the smallest distance between S1 and S2– In other words, the smallest number of edit operations to covert

S1 into S2

• Edit Operations – Insert (I), Delete (d), align(a)

Page 3: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Example• S1: TCGACGTCA• S2: TGACGTGC

• Three operations to convert S1 to S2: S1: TCGACGTGCA

S2: T GACGTGC

– Delete C (position 2) and A (position 10) – Insert G (position 8)

Page 4: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

Cost of inserting T into S1 to match S2

S1 is empty

S2 is empty

Page 5: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

Cost of inserting TC into S1 to match S2

S1 is empty

S2 is empty

2i

Page 6: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i 2i 3i 4i 5i 6i 7i 8i 9i

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

S1 is empty

S2 is empty

Page 7: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i 2i 3i 4i 5i 6i 7i 8i 9i

1d

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

Cost of deleting T from S1 to match S2

S1 is empty

S2 is empty

Page 8: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i 2i 3i 4i 5i 6i 7i 8i 9i

1d

2d

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

Cost of deleting TG from S1 to match S2

S1 is empty

S2 is empty

Page 9: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i 2i 3i 4i 5i 6i 7i 8i 9i

1d

2d

3d

4d

5d

6d

7d

8d

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

S1 is empty

S2 is empty

Page 10: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i 2i 3i 4i 5i 6i 7i 8i 9i

1d

2d

3d

4d

5d

6d

7d

8d

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

S1 is empty

S2 is empty

What we did so far is called Initialization Phase

M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k)

Page 11: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 i 2i 3i 4i 5i 6i 7i 8i 9i

1d

2d

3d

4d

5d

6d

7d

8d

S1

S2

**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a

S1 is empty

S2 is empty

For simplicity lets assume the following costs:Cost of insert (i) = 1Cost of delete (d) = 1 0 if aligned characters are the sameCost of align (a) = 1 if aligned characters are different

Page 12: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

For simplicity lets assume the following costs:Cost of insert (i) = 1Cost of delete (d) = 1 0 if aligned characters are the sameCost of align (a) = 1 if aligned characters are different

Page 13: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

i,j

Smallest Cost for converting S1[1..i] to match S2[1...j]

n,m

Our goal is to covert S1[1..n] to match S2[1…m]

Page 14: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

i,j

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

Page 15: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Case 1

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

i,j

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

Optimal of matching TGA from S1 with TCGA from S2 + align C with C

Page 16: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Case 2

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

i,j

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

Optimal of matching TGA from S1 with TCGAC from S2 + delete C from S1

Page 17: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Case 3

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

i,j

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

Optimal of matching TGAC from S1 with TCGA from S2 + insert C from S1

Page 18: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

0Case 1: 0 + 0 = 0 Case 2: 1 + 1 = 2Case 3: 1 + 1 =2

Page 19: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

0 1Case 1: 1 + 1 = 2 Case 2: 2 + 1 = 3Case 3: 0 + 1 =1

Page 20: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

0 1 2 3 4 5 6 7 8

Page 21: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

0 1 2 3 4 5 6 7 8

1

Page 22: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

0 1 2 3 4 5 6 7 8

1 1 1 2 3 4 5 6 7

Page 23: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

0 1 2 3 4 5 6 7 8

1 1 1 2 3 4 5 6 7

Case 1: 4 + 0 = 4 Case 2: 5 + 1 = 6Case 3: 3 + 1 = 4

Two equivalent options to reach this cell

Page 24: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

M[i-1, j-1] + cost of align S1[i] and S2[j]

M[i-1, j] + cost of delete S1[i]

M[i, j-1] + cost of insert S2[j] into S1

Min

0 1 2 3 4 5 6 7 8

1 1 1 2 3 4 5 6 7

2 2 2 1 2 3 4 5 6

Page 25: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Complete Example

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

**Edit operations on S1 that converts it into S2

S1 is empty

S2 is empty

M[i, j] =

0 1 2 3 4 5 6 7 8

1 1 1 2 3 4 5 6 7

2 2 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Final answer(To covert from S1 to S2 we need 3 edit operations)

Page 26: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Summary of Steps

>> We considers all combinations (all possible alignments)(Navigate the solution space)

>> We started will small sub-problems to solve optimally (Optimal sub-structure)

>> At each step from problem of size K, use the results from the possible K-1 sub-problems to find your best answer

(Need to keep these results, not compute them again)

Page 27: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Algorithm

int matrix[n+1][m+1];

for (x = 0; x <= n; x++) matrix[x][0] = x;

for (y = 1; y <= m; y++) matrix [0][y] = y;

for (x = 1; x <= n; x++)

for (y = 1; y <= m; y++)

if (S1[x] == S2[y])

matrix[x][y] = matrix[x-1][y-1];

else

matrix[x][y] = min(matrix[x][y-1] + 1,

matrix[x-1][y] + 1);

return matrix[n][m];

Initialization step

S1 of size n, S2 of size m

If matching, then go diagonal with 0 additional cost

Consider the other two options and take the least

Page 28: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Edit Distance: Algorithm Analysis

>> We compute (n m) cells

>> For each cell we compare with at most 3 surrounding cells

Time Complexity O (nm)

Space Complexity is also O (nm)

Page 29: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

How to Backtrack

• Keep extra information with each cell c– From where did you arrive to c (diagonal, left, or top)

We now know that the cost is 3. What are the operations and in what order?

Always in Dynamic Programming, to backtrack you may need to keep which optimal sub-problem did you use at each step

Page 30: Dynamic Programming (Edit Distance). Edit Distance Input:  Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Backtrack

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

S1

S2

S1 is empty

S2 is empty

M[i, j] =

0 1 2 3 4 5 6 7 8

1 1 1 2 3 4 5 6 7

2 2 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Means align Means insert

Means delete

Operations of S1

A C G TC G C AT

A C G TG CC AGT

A C G TG G CT

Original S1

Insert C (position 2)Delete G (position 7)Insert A (position 9