Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bioinformatics Algorithms and Data Structures

Chapter 12: Refining Core String Edits and Alignments

Lecturer: Dr. RoseSlides by: Dr. Rose

February 13, 2003


Technology


Homework: due 2/20/03

Chapter 11 questions:

• #1

• #4

• #7

• #8

Additional question for gradstudents

• #10


Technology


Linear Space Alignments

• Dynamic program takes (nm) space for alignments.

• Can alignments be computed in linear space?• Hirschberg’s method

– Good news: reduces space from (nm) to O(n) where n<m.

– Bad News: doubles worst case time bound.


Technology


Linear Space for Similarity

• Recall: similarity is expressed as a single scalar.• There is an alignment that corresponds to this

scalar.– i.e., the optimal alignment whose values is this scalar.

• We’ve needed the O(n*m) table for the alignment.• Q: If we only want the similarity value do we need

the table?• A: No. We only need the space required to

compute the value.


Technology



• Q: How much space is that?• A: Two rows.

– Recall, computing cell (i, j) we need cells (i -1, j - 1) , (i -1, j), (i, j - 1).

– Cells (i -1, j - 1) and (i -1, j) are on the previous row.– Cells (i, j - 1) and (i, j) are on the current row.– We only need the current row, C, and the previous

row, P.– When the current row is done, copy it to the previous

row for the next iteration, i.e., C P


Technology



– After n iterations row C holds the last row n of the full table.

– The similarity value V(n,m) is in the last cell of C.

– The time complexity is still O(nm) but space is now O(m).


Technology



• Q: How can we find the actual alignment in linear space?

• Consider an alignment solution path in the table computation.

m

n


Technology



• Imagine that we knew that the optimal alignment solution path went through cell (n/2, k*)?

m

n

n/2

k*

• Knowing this, we could solve the problem by piecing together solution paths for the diagonal quadrants.


Technology



m

n

n/2

k*

• As important, we could ignore the antidiagonal quadrants.

• We could repeat this process, reducing the amount of space needed to find the optimal alignment.


Technology



m

n

n/2

k*

n/4

3n/4

• Repeating this process would reduce the amount of space needed to find the optimal alignment.

• Q: How far can we go?


Technology



• Q: How can we find the cell (n/2, k*)? • A: Stay tuned!• Defn. Let r denote the reverse of string .

• Defn. Vr(i,j) is the similarity of the first i

characters of Sr1 with the first j characters of Sr

2.

n

m j

i Sr1

Sr2


Technology



• An equivalent formulation: Vr(i,j) is the similarity of the last i characters of S1 with the last j characters of S2.

n

m m - j

n - i S1

S2

• It should be obvious how Vr(i,j) can be computed in O(nm) time and O(m) space.

• Furthermore, any row can be computed in O(m) space.


Technology



• Lemma: V(n, m) = max0k m

[V(n/2,k) + Vr(n/2,m-k)]

• Q: What does this lemma say?• A: The solution to alignment value V(n, m) is the

sum of the smaller alignment problems V(n/2,k) & Vr(n/2,m-k) where k is chosen to yield the largest sum.

m

n

n/2

k


Technology



Defn: Let k* be the position k that maximizes V(n/2,k) + Vr(n/2,m-k)

Defn: Let L denote the solution path from cell (0,0) to cell (n,m)

m

n

n/2

k*

L


Technology


m

n

n/2

k*

L n/2 - 1

n/2 + 1

Ln/2

Linear Space AlignmentsDefn: Let Ln/2 be the the subpath of L that starts with

the last node of L in row n/2 –1 and ends with the first node of L in n/2+1


Technology


Linear Space AlignmentsLemma:

1. Position k* can be found in row n/2 in time O(nm) and space O(m).

2. The subpath Ln/2 can be found and stored in the same bounds.

Proof sketch:

1. Process first n/2 rows to find S1 & S2 alignment, saving row n/2 with traceback pointers.

2. Process first n/2 rows to find Sr1 & Sr

2 alignment saving row n/2 with traceback pointers.


Technology


Linear Space AlignmentsProof sketch continued :

3. For each k, add V(n/2,k) and Vr(n/2,m-k).

4. Set k* to be the k that maximizes V(n/2,k) + Vr(n/2,m-k).

Steps 1 & 2 take O(nm) time and O(m) space.

Steps 3 & 4 take O(m) time.

5. One set of traceback pointers leads from k* lead to k1 in row n/2-1.

6. The other leads from k* lead to k2 in row n/2+1.

Steps 5 & 6 give the subpath Ln/2.


Technology


Linear Space AlignmentsSummary :

In O(mn) time and O(m) space we have:1. Found the value V(n,m)

2. Found k*, k1 , and k2

3. Found the subpath Ln/2.

4. Created two subproblems:1. Aligning S1[1..n/2-1] with S2[1..k1]

2. Aligning S1[n/2+1..n] with S2[k2..m]


Technology


B

A

m

n

n/2

k*

L n/2 - 1

n/2 + 1

Ln/2

k1

k2


Aligning S1[1..n/2-1] with S2[1..k1] is the top problem labeled A.Aligning S1[n/2+1..n] with S2[k2..m] is the bottom problem labeled B.


Technology


Linear Space AlignmentsQ: What is the dynamic programming time for a p by q table?

A: cpq, where c is some constant.

Q: Determining the n/2th row of a n by m table takes how long?

A: cnm/2 Thus cnm time to process the two rows (V & Vr).

We can solve problems A and B in time proportional to their total size.

The middle row of A can be determined in ck*n/2 The middle row of B can be determined in c(m-k*)n/2 Altogether this is cnm/2.


Technology


Linear Space AlignmentsQ: How are we going to find the optimal alignment in

linear space?

A: Use recursion!

m

n

B

A

m

n

n/2

k*

L n/2 - 1

n/2 + 1

Ln/2

k1

k2

AB

AA

m/2

n/2

n/4

k*

L/2 n/4 - 1

n/4 + 1

Ln/4

k1

k2

BB

BA

m

n

3n/4

k*

L/2 3n/4 - 1

3n/4+1

Ln3/4

k1

k2

n/2+1

m/2+1


Technology


Linear Space AlignmentsHirschberg’s Algorithm

Procedure OPTA(l,l´,r,r´){h = (l´- l)/2; /* midpoint of first substring */Find k*, k1, k2, & Lh in space O(l´- l) = O(m)

OPTA(l,h-1, r, k1); /* new top problem */

output subpath Lh;

OPTA(h+1, l´, k2, r´); /* new bottom problem */

}

The first call is: OPTA(1,n,1,m)


Technology


Linear Space AlignmentsAnalysis:

• The first call uses cnm time• The second call uses cnm/2 time for 2 subproblems• The ith level of recursion entail 2i-1 subproblems• There are n/2i-1 rows in each of the level i problems• The total time at level i is cnm/2i-1

Thm. Hirschberg’s optimal alignment algorithm takes time 1+log n

cnm/2i-1 2cnm and space O(m).


Technology


Linear Space AlignmentsQ: What about computing local alignment?Recall:

– This is solved by finding the cell (i*, j*) with maximum value v.– i* and j* are the end indices of substrings and .

We can compute v row-wise. (recall v(i,j) is the optimal suffix alignment chapt 11)

use only linear space.Q:How do we find the start indices of and without the full

table?A: Author suggests reverse dynamic programming.Huh? ‘reverse the polarity’? Where is Dr. Who?


Technology


Linear Space AlignmentsFinding the start indices of and :

Extend the algorithm for v to set pointer h(i, j) for each cell (i, j):

If v(i, j) = 0 then set h(i, j) to (i, j) If v(i, j) > 0 & normal traceback pointer would be to cell (p, q)

then set h(i, j) to h(p, q).

Consequently, h(i*, j*) specifies the starting cell, i.e., the starting positions of and .

Finding and can be done in linear space. Local alignment can be done in O(nm) time &

O(m) space.


Technology


Bounded DifferencesImagine problems in which there is a bound on the

number of expected differences.

Q: Can we solve the alignment in faster than O(nm)?

A: Yes, if the alignment contains at most k differences O(km) is possible.

Core Idea: The main diagonal is comprised of cells (i,i), i n m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.


Technology


Bounded DifferencesCore Idea: The main diagonal is comprised of cells (i,i),

i n m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

Example: k = 3, (i, i) main diagonal, (i, i - l), (i, i + l), l > k bounds


Technology


Bounded DifferencesRecall: a solution path must extend from cell (0,0) ending

along or to the right of the main diagonal in cell (n,m)

Observation: k >= m – n is required for a k-difference solution to exist.

Example: k = 3, (i, i) main diagonal, (i, i - l), (i, i + l), l > k bounds


Technology


Bounded Differences

Q: How can we achieve time complexity O(km) in a table with O(nm) entries?

A: Only fill O(km) of the O(nm) cells straddling the main diagonal. Example: k = 3, (i, i) main diagonal, (i, i - l), (i, i + l), l k bounds


Technology


Bounded Differences

Algorithm: Fill in the table in strips 2k+1 cells wide centered on the main diagonal.

Cell (2,5) is computed from cells (1,4) & (2,4) but not cell (1,5) Cell (5,3) is computed from cells (4,2) & (4,3) but not cell (5,2)

1,4 2,4

1,5

5,2 4,3 4,2 5,3

2,5

Note: The recurrence requires the three neighboring cells.

Q: How do we handle neighbors in the forbidden zone?

A: Ignore them.


Technology


Bounded Differences

Thm. There is a global alignment of S1 and S2 with at most k differences IFF the algorithm from the previous slide assigns cell (n,m) the value k or less.

The k-difference global alignment problem can be solved in time O(km) and space O(km).


Technology


Bounded DifferencesQ: What if we don’t know the value of k?Q: How can we decide on a k value?Soln. Start with k = 1.

If no solution is found let k = 2 * kRepeat the doubling of k until a solution is found.

• We double k to find the optimal value k*.• We stop doubling k when a solution is found.• k* will be the best alignment with the current value of k.• Since we have been doubling k, k* k.


Technology


Bounded Differences

Thm. The doubling of k, starting from1, yields a k-difference alignment with the edit distance k* and its alignment in O(k* m) time and space.

Proof: Let kL be the largest value of k used for a given pair of strings. Then kL 2k*. The effort involved is O(kLm + kLm/2 + kLm/4 + .. + m) = O(kLm). But, O(kLm) = O(k* m).

Q: Why do we state kL 2k* instead of kL < 2k* ?


Technology


HomeworkDue 2/27/03

Part 1#24. Show how to solve the alphabet-weight alignment problem with affine

weights in O(nm) time.

#27. The recurrence relations we developed for the affine gap model follow the logic of paying Wg + Ws when a gap is initiated and then paying Ws for each additional space used in that gap. An alternative logic is to pay Wg + Ws at the point when the gap is “completed”. Write recurrence relations for the affine gap model that follows that logic. The recurrences should compute the alignment in O(nm) time.

Continued on next page.


Technology


Homework

Part 2#1. Show how to compute the value V(n,m) of the optimal

alignment using only min(n,m) +1 space in addition to the space needed to represent the two input strings.

#4. Show how to reduce the size of the strip needed in the method of Section 12.2.3, when |m - n| < k.

Continued on next page.


Technology


Homework

Part 2 continued

Gradstudents only:

#5. Fill in the details of how to find the actual alignments of P in T that occur with at most k differences. The method uses the O(km) values stored during the k differences algorithm. The solution is somewhat simpler if the k differences algorithm also stores a sparse set of pointers recording how each farthest-reaching d-path extends a farthest-reaching (d-1)-path. These pointers only take O(km) space and are a spare version of the standard dynamic programming pointers. Fill in the details of this approach as well.

Required portion of question.

Optional, extra credit portion of question.

Bioinformatics Algorithms and Data Structures

Documents

Transcript of Bioinformatics Algorithms and Data Structures