Dynamic Programming (Edit Distance). Edit Distance Input: Two input strings S1 (of size n) and S2...
-
Upload
melina-carroll -
Category
Documents
-
view
228 -
download
0
description
Transcript of Dynamic Programming (Edit Distance). Edit Distance Input: Two input strings S1 (of size n) and S2...
Dynamic Programming(Edit Distance)
Edit Distance• Input:
– Two input strings S1 (of size n) and S2 (of size m)• E.g., S1 = ATTTCTAGTGGGTAAA• S2 = ATCTAGTTTAGGGATA
• Target:– Find the smallest distance between S1 and S2– In other words, the smallest number of edit operations to covert
S1 into S2
• Edit Operations – Insert (I), Delete (d), align(a)
Example• S1: TCGACGTCA• S2: TGACGTGC
• Three operations to convert S1 to S2: S1: TCGACGTGCA
S2: T GACGTGC
– Delete C (position 2) and A (position 10) – Insert G (position 8)
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
Cost of inserting T into S1 to match S2
S1 is empty
S2 is empty
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
Cost of inserting TC into S1 to match S2
S1 is empty
S2 is empty
2i
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i 2i 3i 4i 5i 6i 7i 8i 9i
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
S1 is empty
S2 is empty
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i 2i 3i 4i 5i 6i 7i 8i 9i
1d
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
Cost of deleting T from S1 to match S2
S1 is empty
S2 is empty
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i 2i 3i 4i 5i 6i 7i 8i 9i
1d
2d
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
Cost of deleting TG from S1 to match S2
S1 is empty
S2 is empty
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i 2i 3i 4i 5i 6i 7i 8i 9i
1d
2d
3d
4d
5d
6d
7d
8d
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
S1 is empty
S2 is empty
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i 2i 3i 4i 5i 6i 7i 8i 9i
1d
2d
3d
4d
5d
6d
7d
8d
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
S1 is empty
S2 is empty
What we did so far is called Initialization Phase
M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k)
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 i 2i 3i 4i 5i 6i 7i 8i 9i
1d
2d
3d
4d
5d
6d
7d
8d
S1
S2
**Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete dCost of align a
S1 is empty
S2 is empty
For simplicity lets assume the following costs:Cost of insert (i) = 1Cost of delete (d) = 1 0 if aligned characters are the sameCost of align (a) = 1 if aligned characters are different
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
For simplicity lets assume the following costs:Cost of insert (i) = 1Cost of delete (d) = 1 0 if aligned characters are the sameCost of align (a) = 1 if aligned characters are different
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
i,j
Smallest Cost for converting S1[1..i] to match S2[1...j]
n,m
Our goal is to covert S1[1..n] to match S2[1…m]
Edit Distance
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
i,j
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
Edit Distance: Case 1
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
i,j
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
Optimal of matching TGA from S1 with TCGA from S2 + align C with C
Edit Distance: Case 2
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
i,j
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
Optimal of matching TGA from S1 with TCGAC from S2 + delete C from S1
Edit Distance: Case 3
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
i,j
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
Optimal of matching TGAC from S1 with TCGA from S2 + insert C from S1
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
0Case 1: 0 + 0 = 0 Case 2: 1 + 1 = 2Case 3: 1 + 1 =2
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
0 1Case 1: 1 + 1 = 2 Case 2: 2 + 1 = 3Case 3: 0 + 1 =1
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
0 1 2 3 4 5 6 7 8
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
0 1 2 3 4 5 6 7 8
1
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
0 1 2 3 4 5 6 7 8
1 1 1 2 3 4 5 6 7
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
0 1 2 3 4 5 6 7 8
1 1 1 2 3 4 5 6 7
Case 1: 4 + 0 = 4 Case 2: 5 + 1 = 6Case 3: 3 + 1 = 4
Two equivalent options to reach this cell
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
M[i-1, j-1] + cost of align S1[i] and S2[j]
M[i-1, j] + cost of delete S1[i]
M[i, j-1] + cost of insert S2[j] into S1
Min
0 1 2 3 4 5 6 7 8
1 1 1 2 3 4 5 6 7
2 2 2 1 2 3 4 5 6
Edit Distance: Complete Example
A C G TC G C AT
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
**Edit operations on S1 that converts it into S2
S1 is empty
S2 is empty
M[i, j] =
0 1 2 3 4 5 6 7 8
1 1 1 2 3 4 5 6 7
2 2 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
Final answer(To covert from S1 to S2 we need 3 edit operations)
Summary of Steps
>> We considers all combinations (all possible alignments)(Navigate the solution space)
>> We started will small sub-problems to solve optimally (Optimal sub-structure)
>> At each step from problem of size K, use the results from the possible K-1 sub-problems to find your best answer
(Need to keep these results, not compute them again)
Edit Distance: Algorithm
int matrix[n+1][m+1];
for (x = 0; x <= n; x++) matrix[x][0] = x;
for (y = 1; y <= m; y++) matrix [0][y] = y;
for (x = 1; x <= n; x++)
for (y = 1; y <= m; y++)
if (S1[x] == S2[y])
matrix[x][y] = matrix[x-1][y-1];
else
matrix[x][y] = min(matrix[x][y-1] + 1,
matrix[x-1][y] + 1);
return matrix[n][m];
Initialization step
S1 of size n, S2 of size m
If matching, then go diagonal with 0 additional cost
Consider the other two options and take the least
Edit Distance: Algorithm Analysis
>> We compute (n m) cells
>> For each cell we compare with at most 3 surrounding cells
Time Complexity O (nm)
Space Complexity is also O (nm)
How to Backtrack
• Keep extra information with each cell c– From where did you arrive to c (diagonal, left, or top)
We now know that the cost is 3. What are the operations and in what order?
Always in Dynamic Programming, to backtrack you may need to keep which optimal sub-problem did you use at each step
Backtrack
A
C
G
T
G
T
G
C
0 1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
S1
S2
S1 is empty
S2 is empty
M[i, j] =
0 1 2 3 4 5 6 7 8
1 1 1 2 3 4 5 6 7
2 2 2 1 2 3 4 5 6
3 2 3 2 1 2 3 4 5
4 3 2 3 2 1 2 3 4
5 4 3 4 3 2 1 2 3
6 5 4 5 4 3 2 3 4
7 6 5 6 5 5 3 2 3
Means align Means insert
Means delete
Operations of S1
A C G TC G C AT
A C G TG CC AGT
A C G TG G CT
Original S1
Insert C (position 2)Delete G (position 7)Insert A (position 9