CSCI 256 Data Structures and Algorithm Analysis Lecture 16 Some slides by Kevin Wayne copyright...
-
Upload
tamsin-mccarthy -
Category
Documents
-
view
216 -
download
0
description
Transcript of CSCI 256 Data Structures and Algorithm Analysis Lecture 16 Some slides by Kevin Wayne copyright...
CSCI 256 Data Structures and Algorithm Analysis Lecture 16
Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some by Iker Gondra
Dynamic Programming Review
• Recipe– Characterize structure of problem– Recursively define value of optimal solution– Compute value of optimal solution– Construct optimal solution from computed information
• Dynamic programming techniques– Binary choice: weighted interval scheduling – Multi-way choice: segmented least squares– Adding a new variable: knapsack– Dynamic programming over intervals: RNA secondary structure
RNA Secondary Structure
• RNA: String B = b1b2bn over alphabet { A, C, G, U }• Secondary structure: RNA is single-stranded so it tends
to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule
G
U
C
A
GA
A
G
CG
A
UG
A
U
U
A
G
AC A
A
C
U
G
A
G
U
C
AU
C
GG
G
C
C
G
Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA
complementary base pairs: A-U, C-G
RNA Secondary Structure
• Secondary structure: A set of pairs S = { (bi, bj) } that satisfy
– [Watson-Crick] S is a matching and each pair in S is a Watson-Crick complement: A-U, U-A, C-G, or G-C
– [No sharp turns] The ends of each pair are separated by at least 4 intervening bases. If (bi, bj) S, then i < j - 4
– [Non-crossing] If (bi, bj) and (bk, bl) are two pairs in S, then we cannot have i < k < j < l
CG G
C
A
G
U
U
U A
A U G U G G C C A U
ok
G G
C
A
G
U
U A
G
A U G G G C A U
sharp turn
G4
CG G
C
A
U
G
U
U A
A G U U G G C C A U
crossing
RNA Secondary Structure
• Out of all the secondary structures that are possible for a single RNA molecule, which are the ones that are likely to arise?– Free energy: Usual hypothesis is that an RNA
molecule will form the secondary structure with the optimum total free energy
– Goal: Given an RNA molecule B = b1b2bn, find a secondary structure S that maximizes the number of base pairs
approximate by number of base pairs
http://www.genebee.msu.su/services/rna2_reduced.html
RNA Secondary Structure: Subproblems
• First attempt: OPT(j) = maximum number of base pairs in a secondary structure on substring b1b2bj. Either– j is not involved in a pair
• Find optimal secondary structure in: b1b2bj-1
– j pairs with t for some t < j – 4
OPT(j-1)
1 t j
RNA Secondary Structure: Subproblems
If j pairs with some t where t < j – 4, then the no crossover rule tells us that we can’t have a base pair (k,l) where k < t < l < j; this implies we can’t have (k,l) where
1 ≤ k ≤ t -1 and t + 1 ≤ l ≤ j -1This means that any other pair (k,l) in an optimal structure is
either in b1,b2,…,bt-1 or in bt+1,…bj-1
So we must look at two subproblems which are decoupled due to the noncrossing constraint:
• Find the optimal secondary structure in: b1b2bt-1
• Find the optimal secondary structure in: bt+1bt+2bj-1
RNA Secondary Structure: Subproblems
• What is different here????• The second subproblem is not on our list of
subproblems, because it does not begin with b1.• We need more subproblems! • We need to be able to work with subproblems
that do not begin with b1.
Dynamic Programming Over Intervals
• Notation: OPT(i, j) = maximum number of base pairs in a secondary structure of the substring bibi+1bj
– Case 1: i j - 4• OPT(i, j) = 0 by no-sharp turns condition
– Case 2: i < j - 4If Base bj is not involved in a pair (Watson –Crick)
• OPT(i, j) = OPT(i, j-1)
If Base bj pairs with bt for some t, i t < j - 4the non-crossing constraint means that:OPT(i, j) = 1 + max t { OPT(i, t-1) + OPT(t+1, j-1) }
take max over t such that i t < j-4 andbt and bj are Watson-Crick complements
• Hence if i < j – 4 we want the maximum of the two values for Opt(i,j)
• Opt(i,j) = max ( OPT(i, j-1), 1 + max t { OPT(i, t-1) + OPT(t+1, j-1) } )
Dynamic Programming Over Intervals
• What order to solve the sub-problems?– Looking at the recurrence relation, we see that we are
invoking solutions to subproblems on shorter intervals– Need to evaluate Opt for shortest intervals first – this
is different from the subset sum (and knapsack) strategy of doing row by row
– To achieve this need to set an auxillary variable k to a constant and use values of i and j which keep j-i = k
– As k gets larger, the interval for the subproblem bi,bi+1,…bj grows
Dynamic Programming Over Intervals
– Running time: O(n3) Why???
RNA(b1,…,bn) { Initialize Opt[i, j] = 0 whenever i j-4 (ie, i+4 ≥ j) for k = 5, 6, …, n-1 for i = 1, 2, …, n-k set j = i + k Compute Opt[i, j]
return Opt[1, n]}
using recurrence
Running time analysis:
• There are O(n2) subproblems to solve and evaluating the recurrence in each problem takes O(n) time (because we have to find the max over the t’s such that bt and bj are allowable pairs)
• So running time is O(n3)
Example: ACCGGUAGU
• Recall: base pairs allowed: AU, UA, CG, GC• What is the basic array that we need to fill??
(here n= 9)
Example: ACCGGUAGU
• Note if i > j, let Opt (i,j) = 0 (Why??)• Need two dimensions to present the array M of for values for
Opt (i,j) – one for the left endpoint of the interval being considered, and one for the right endpoint
• Some initial values are 0 – whenever i ≥ j – 4 (Why??)• Begin with k = 5; loop over the i’s from 1 to 4 (= 9 – 5)• for Opt (1,6); t = 1 is only t with 1 ≤ t < 6 - 4, and b1b6 is AU
allowable base pair so• Opt (1,6) = max( 0, max(1+0+0) ) = 1• Opt (2,7) t = 2 is only t with 2 ≤ t < 7-4; but b2b7 is CA – not an
allowable base pair so no ts satisfy the conditions and Opt(2,7) = Opt (2,6) = 0
• Next value??
Example: ACCGGUAGU
• Opt(3,8); t = 3 only possible t and b3b8 is allowable base pair so
Opt(3,8)= max( Opt(3,7), max ( 1 + Opt(3,0) + Opt(2,7) ) ) =max ( 0, max(1 + 0+0) ) = 1
Next value to calculate??
Example: ACCGGUAGU
• It is Opt(4,9) • Now let k = 6 and do • Opt( 1, 7) then Opt( 2,8), then Opt (3,9) ….• Note for Opt (1,7) both t = 1 and t = 2 satisfy the
inequality i ≤ t < j – 4 (i = 1 and j = 7); are both base pair allowable?
• This is a fully worked example in the text – check out more values to make sure you are following the algorithm