CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings...
Transcript of CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings...
![Page 1: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/1.jpg)
CS3000:Algorithms&DataJonathanUllman
Lecture9:• DynamicProgramming:EditDistance,RNAFolding
Oct5,2018
![Page 2: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/2.jpg)
OfficeHourssched2
IIIsoso.IE oos7To s oo 31 s
![Page 3: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/3.jpg)
EditDistanceAlignments
![Page 4: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/4.jpg)
DistanceBetweenStrings
• Autocorrectworksbyfindingsimilarstrings
• ocurrance andoccurrence seemsimilar,butonlyifwedefinesimilaritycarefully
ocurranceoccurrence
oc urranceoccurrenceit
![Page 5: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/5.jpg)
EditDistance/Alignments
• Giventwostrings! ∈ Σ$, & ∈ Σ',theeditdistanceisthenumberofinsertions,deletions,andswapsrequiredtoturn! into&.
• Givenanalignment,thecostisthenumberofpositionswherethetwostringsdon’tagree
o c u r r a n c eo c c u r r e n c e
I
O OyAHou gaps but not
transpositions
cost of the alignment is the of columns where too
symbols dont agree
![Page 6: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/6.jpg)
AsktheAudience
• Whatistheminimumcostalignmentofthestringssmitten andsitting
editdistance
Edit Dist 3
s m i t t e n
s i t t i n gL 2 3
smitten sitter sittin sitting
![Page 7: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/7.jpg)
EditDistance/Alignments
• Input: Twostrings! ∈ Σ$, & ∈ Σ'• Output: Theminimumcostalignmentof! and&• EditDistance=costoftheminimumcostalignment
on goesin theAnand
![Page 8: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/8.jpg)
DynamicProgramming
• Considertheoptimal alignmentof!, &• Threechoicesforthefinalcolumn• CaseI:onlyuse! (!$,− )• CaseII:onlyuse& (−, &' )• CaseIII:useonesymbolfromeach(!$, &' )
optimalalignment optimalalignment optimalalignment
Xl Xn 1 Xn X1 Xn Xi Xu 1 Xn
Y Jm Yi yn i ym Yi Ym i ym
To determine which case is best solve alignmenton a smaller X Y
![Page 9: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/9.jpg)
DynamicProgramming
• Considertheoptimal alignmentof!, &• CaseI:onlyuse! (!$, − )• deletion+optimalalignmentof!):$+), &):'
• CaseII:onlyuse& (−, &' )• insertion+optimalalignmentof!):$, &):'+)
• CaseIII:useonesymbolfromeach(!$, &' )• If!$ = &':optimalalignmentof!):$+), &):'+)• If!$ ≠ &':mismatch+opt.alignmentof!):$+), &):'+)
![Page 10: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/10.jpg)
DynamicProgramming
• ./0 1, 2 =costofopt.alignmentof!):3 and&):4• CaseI:onlyuse! (!3, − )• CaseII:onlyuse& (−, &4 )• CaseIII:useonesymbolfromeach(!3, &4 )
f nti ma subproblems
I 1 OPT i l jl t OPT i j 1
OPT i t j t t 1 if Xi y0PT i jope f j M if Xi yj
opt iI mm OPT i bj OPT i j t OPT i i j i if yjmm It OPT fit HOPT i j D OPTCi t j D x yj
![Page 11: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/11.jpg)
DynamicProgramming
• ./0 1, 2 =costofopt.alignmentof!):3 and&):4• CaseI:onlyuse! (!3, − )• CaseII:onlyuse& (−, &4 )• CaseIII:useonesymbolfromeach(!3, &4 )
Recurrence:
OPT 8, 9 = : 1 + min OPT 8 − 1, 9 , OPT 8, 9 − 1 , OPT(8 − 1, 9 − 1)min{1 + DEF 8 − 1, 9 , 1 + OPT 8, 9 − 1 , OPT(8 − 1, 9 − 1)}
BaseCases:OPT 8, 0 = 8,OPT 0, 9 = 9
![Page 12: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/12.jpg)
Examplex = perty = beast
- b e a s t-pert
Ogl 2 3 4
51I 2 3 4
522RI B 2 3 4
3 3 2 2 3q4
4 4 3 3theft
![Page 13: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/13.jpg)
FindingtheAlignment
• ./0 1, 2 =costofopt.alignmentof!):3 and&):4• CaseI:onlyuse! (!3, − )• CaseII:onlyuse& (−, &4 )• CaseIII:useonesymbolfromeach(!3, &4 )
![Page 14: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/14.jpg)
EditDistance(“Bottom-Up”)
// All inputs are global varsFindOPT(n,m):M[0,j]←j, M[i,0]←i
for (i= 1,…,n):for (j = 1,…,m):if (xi = yj):M[i,j] = min{1+M[i-1,j],1+M[i,j-1],M[i-1,j-1]
elseif (xi != yj):M[i,j] = 1+min{M[i-1,j],M[i,j-1],M[i-1,j-1]}
return M[n,m]
Ohm entries x Oli perentry O nm time
01hm space just for the table
![Page 15: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/15.jpg)
AsktheAudience
• Supposeinserting/deletingcostsJ > L andswappingM ↔ O costsPM,O > L• Writearecurrenceforthemin-costalignment
![Page 16: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/16.jpg)
EditDistanceSummary
• Computetheeditdistance,ormin-costalignmentbetweentwostringsintime/spaceD QR
• DynamicProgramming:• Decidethefinalpairofsymbolsinthealignment
• Spacecanbeprohibitiveinpractice• ComputeeditdistanceinspaceD min Q,R• CanalsofindalignmentinspaceD Q +R usingacleverdivide-and-conqueralgorithm!
![Page 17: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/17.jpg)
RNAFolding
![Page 18: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/18.jpg)
DNA
• DNAisastringoffourbases{A,C,G,T}• TwocomplementarystrandsofDNAsticktogetherandformadoublehelix• A—T andC—G arecomplementarypairs
![Page 19: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/19.jpg)
RNAFolding
• RNAisastringoffourbases{A,C,G,U}• AsingleRNAstrandstickstoitselfandfoldsintocomplexstructures• A—U andC—G arecomplementarypairs
![Page 20: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/20.jpg)
RNAFolding
• RNAstrandwilltrytominimizeenergy (formthemostbonds)subjecttoconstraints
c
crossing
i
Iotcomplementary
sharp turn
![Page 21: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/21.jpg)
RNAFolding
• RNAisastringofbasesOS,… , OU ∈ V, W, X, Y• ThestructureisgivenbyasetofbondsZ consistingofpairs 8, 9 with8 < 9• (Complements)Only\ − ] or^ − _ canbepaired• (Matching)Nobase`3 isintwopairsinZ• (NoSharpTurns)If 8, 9 ∈ Z,then8 < 9 − 4• (Non-Crossing)If 8, 9 , b, ℓ ∈ Z thenitcannotbethecasethat8 < b < 9 < ℓf
f IT TET tuft 5 j
![Page 22: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/22.jpg)
RNAFolding
• Input:RNAsequence`), … , `$ ∈ \, ^, _, ]• Output:AsetofpairsZ ⊆ 1,… , Q × 1,… , Q• Goal:maximizethesizeofZ• (Complements)Only\ − ] or^ − _ canbepaired• (Matching)Nobase`3 isintwopairsinZ• (NoSharpTurns)If 8, 9 ∈ Z,then8 < 9 − 4• (Non-Crossing)If 8, 9 , b, ℓ ∈ Z thenitcannotbethecasethat8 < b < 9 < ℓ
![Page 23: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/23.jpg)
DynamicProgramming
• LetD betheoptimalsetofpairsfor`) ⋯`$• Case1: Q pairswithnothinginD
• Case2:Q pairswithsomeg < Q − 4 inD
Thes O is the opt set of pairs for b sbn
Then 0 is It n t opt set of pans for bee bopt set of pairs for b be
non crossing
Moon an l n
![Page 24: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/24.jpg)
DynamicProgramming
• LetD3,4 betheoptimalsetofpairsfor`3 ⋯ 4̀• Case1: 9 pairswithnothinginD3,4
• Case2:9 pairswithsomeg < 9 − 4 inD3,4
![Page 25: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/25.jpg)
DynamicProgramming
• LetOPT 8, 9 betheopt.number ofpairsfor`3 ⋯ 4̀• Case1: 9 pairswithnothinginD3,4
• Case2t:9 pairswithg < 9 − 4 inD3,4
OPT i j OPT i j l
OPT i j It OPT i t t OPT ttt j MConsider
any ts 1 tej 4 and bigbj are complements
1OPT I t l OPTH11 j l
![Page 26: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/26.jpg)
DynamicProgramming
• LetOPT 8, 9 betheopt.number ofpairsfor`3 ⋯ 4̀• Case1: 9 pairswithnothinginD3,4• Case2t:9 pairswithg < 9 − 4 inD3,4Recurrence:OPT 8, 9= max OPT 8, 9 − 1 ,max OPT 8, g − 1 + OPT g + 1, 9 − 1
BaseCases:OPT 8, 9 = 0 if8 ≥ 9 − 4
Maximumoverallg suchthat• 8 ≤ g < 9 − 4• `l, 4̀ arecompatiblebases
I
![Page 27: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/27.jpg)
FillingtheTable
Recurrence:OPT 8, 9 = max OPT 8, 9 − 1 , maxmnoopqrsl OPT 8, g − 1 + OPT g + 1, 9 − 1
6 7 8 j=9
4 0 0 0
3 0 0
2 0
i=1
Sequence:\^^__]\_]
r's
n 5 i entries
O n perentry
![Page 28: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required](https://reader033.fdocuments.us/reader033/viewer/2022060304/5f08f42d7e708231d4248743/html5/thumbnails/28.jpg)
RNAFoldingSummary
• ComputetheoptimalRNAfolding intimeD QtandspaceD Qu
• DynamicProgramming:• Decideonanoptimalpair`l − `$• RemainingRNAistwonon-overlappingpieces• Addingvariables: onesubproblemforeachinterval
• Non-crossing andmatching arecritical• Thinkabouthowthedynamicprogrammingalgorithmchangesifweremoveeachoftheconditions