Bioinformatics Algorithms and Data Structures

36
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String Edits and Alignments Lecturer: Dr. Rose Slides by: Dr. Rose February 13, 2003

description

Bioinformatics Algorithms and Data Structures. Chapter 12: Refining Core String Edits and Alignments Lecturer: Dr. Rose Slides by: Dr. Rose February 13, 2003. Homework: due 2/20/03. Chapter 11 questions: #1 #4 #7 #8 Additional question for gradstudents #10. Linear Space Alignments. - PowerPoint PPT Presentation

Transcript of Bioinformatics Algorithms and Data Structures

Page 1: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bioinformatics Algorithms and Data Structures

Chapter 12: Refining Core String Edits and Alignments

Lecturer: Dr. RoseSlides by: Dr. Rose

February 13, 2003

Page 2: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Homework: due 2/20/03

Chapter 11 questions:

• #1

• #4

• #7

• #8

Additional question for gradstudents

• #10

Page 3: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

• Dynamic program takes (nm) space for alignments.

• Can alignments be computed in linear space?• Hirschberg’s method

– Good news: reduces space from (nm) to O(n) where n<m.

– Bad News: doubles worst case time bound.

Page 4: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space for Similarity

• Recall: similarity is expressed as a single scalar.• There is an alignment that corresponds to this

scalar.– i.e., the optimal alignment whose values is this scalar.

• We’ve needed the O(n*m) table for the alignment.• Q: If we only want the similarity value do we need

the table?• A: No. We only need the space required to

compute the value.

Page 5: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space for Similarity

• Q: How much space is that?• A: Two rows.

– Recall, computing cell (i, j) we need cells (i -1, j - 1) , (i -1, j), (i, j - 1).

– Cells (i -1, j - 1) and (i -1, j) are on the previous row.– Cells (i, j - 1) and (i, j) are on the current row.– We only need the current row, C, and the previous

row, P.– When the current row is done, copy it to the previous

row for the next iteration, i.e., C P

Page 6: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space for Similarity

– After n iterations row C holds the last row n of the full table.

– The similarity value V(n,m) is in the last cell of C.

– The time complexity is still O(nm) but space is now O(m).

Page 7: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

• Q: How can we find the actual alignment in linear space?

• Consider an alignment solution path in the table computation.

m

n

Page 8: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

• Imagine that we knew that the optimal alignment solution path went through cell (n/2, k*)?

m

n

n/2

k*

• Knowing this, we could solve the problem by piecing together solution paths for the diagonal quadrants.

Page 9: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

m

n

n/2

k*

• As important, we could ignore the antidiagonal quadrants.

• We could repeat this process, reducing the amount of space needed to find the optimal alignment.

Page 10: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

m

n

n/2

k*

n/4

3n/4

• Repeating this process would reduce the amount of space needed to find the optimal alignment.

• Q: How far can we go?

Page 11: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

• Q: How can we find the cell (n/2, k*)? • A: Stay tuned!• Defn. Let r denote the reverse of string .

• Defn. Vr(i,j) is the similarity of the first i

characters of Sr1 with the first j characters of Sr

2.

n

m j

i Sr1

Sr2

Page 12: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

• An equivalent formulation: Vr(i,j) is the similarity of the last i characters of S1 with the last j characters of S2.

n

m m - j

n - i S1

S2

• It should be obvious how Vr(i,j) can be computed in O(nm) time and O(m) space.

• Furthermore, any row can be computed in O(m) space.

Page 13: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

• Lemma: V(n, m) = max0k m

[V(n/2,k) + Vr(n/2,m-k)]

• Q: What does this lemma say?• A: The solution to alignment value V(n, m) is the

sum of the smaller alignment problems V(n/2,k) & Vr(n/2,m-k) where k is chosen to yield the largest sum.

m

n

n/2

k

Page 14: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space Alignments

Defn: Let k* be the position k that maximizes V(n/2,k) + Vr(n/2,m-k)

Defn: Let L denote the solution path from cell (0,0) to cell (n,m)

m

n

n/2

k*

L

Page 15: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

m

n

n/2

k*

L n/2 - 1

n/2 + 1

Ln/2

Linear Space AlignmentsDefn: Let Ln/2 be the the subpath of L that starts with

the last node of L in row n/2 –1 and ends with the first node of L in n/2+1

Page 16: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsLemma:

1. Position k* can be found in row n/2 in time O(nm) and space O(m).

2. The subpath Ln/2 can be found and stored in the same bounds.

Proof sketch:

1. Process first n/2 rows to find S1 & S2 alignment, saving row n/2 with traceback pointers.

2. Process first n/2 rows to find Sr1 & Sr

2 alignment saving row n/2 with traceback pointers.

Page 17: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsProof sketch continued :

3. For each k, add V(n/2,k) and Vr(n/2,m-k).

4. Set k* to be the k that maximizes V(n/2,k) + Vr(n/2,m-k).

Steps 1 & 2 take O(nm) time and O(m) space.

Steps 3 & 4 take O(m) time.

5. One set of traceback pointers leads from k* lead to k1 in row n/2-1.

6. The other leads from k* lead to k2 in row n/2+1.

Steps 5 & 6 give the subpath Ln/2.

Page 18: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsSummary :

In O(mn) time and O(m) space we have:1. Found the value V(n,m)

2. Found k*, k1 , and k2

3. Found the subpath Ln/2.

4. Created two subproblems:1. Aligning S1[1..n/2-1] with S2[1..k1]

2. Aligning S1[n/2+1..n] with S2[k2..m]

Page 19: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

B

A

m

n

n/2

k*

L n/2 - 1

n/2 + 1

Ln/2

k1

k2

Linear Space Alignments

Aligning S1[1..n/2-1] with S2[1..k1] is the top problem labeled A.Aligning S1[n/2+1..n] with S2[k2..m] is the bottom problem labeled B.

Page 20: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsQ: What is the dynamic programming time for a p by q table?

A: cpq, where c is some constant.

Q: Determining the n/2th row of a n by m table takes how long?

A: cnm/2 Thus cnm time to process the two rows (V & Vr).

We can solve problems A and B in time proportional to their total size.

The middle row of A can be determined in ck*n/2 The middle row of B can be determined in c(m-k*)n/2 Altogether this is cnm/2.

Page 21: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsQ: How are we going to find the optimal alignment in

linear space?

A: Use recursion!

m

n

B

A

m

n

n/2

k*

L n/2 - 1

n/2 + 1

Ln/2

k1

k2

AB

AA

m/2

n/2

n/4

k*

L/2 n/4 - 1

n/4 + 1

Ln/4

k1

k2

BB

BA

m

n

3n/4

k*

L/2 3n/4 - 1

3n/4+1

Ln3/4

k1

k2

n/2+1

m/2+1

Page 22: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsHirschberg’s Algorithm

Procedure OPTA(l,l´,r,r´){h = (l´- l)/2; /* midpoint of first substring */Find k*, k1, k2, & Lh in space O(l´- l) = O(m)

OPTA(l,h-1, r, k1); /* new top problem */

output subpath Lh;

OPTA(h+1, l´, k2, r´); /* new bottom problem */

}

The first call is: OPTA(1,n,1,m)

Page 23: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsAnalysis:

• The first call uses cnm time• The second call uses cnm/2 time for 2 subproblems• The ith level of recursion entail 2i-1 subproblems• There are n/2i-1 rows in each of the level i problems• The total time at level i is cnm/2i-1

Thm. Hirschberg’s optimal alignment algorithm takes time 1+log n

cnm/2i-1 2cnm and space O(m).

Page 24: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsQ: What about computing local alignment?Recall:

– This is solved by finding the cell (i*, j*) with maximum value v.– i* and j* are the end indices of substrings and .

We can compute v row-wise. (recall v(i,j) is the optimal suffix alignment chapt 11)

use only linear space.Q:How do we find the start indices of and without the full

table?A: Author suggests reverse dynamic programming.Huh? ‘reverse the polarity’? Where is Dr. Who?

Page 25: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Linear Space AlignmentsFinding the start indices of and :

Extend the algorithm for v to set pointer h(i, j) for each cell (i, j):

If v(i, j) = 0 then set h(i, j) to (i, j) If v(i, j) > 0 & normal traceback pointer would be to cell (p, q)

then set h(i, j) to h(p, q).

Consequently, h(i*, j*) specifies the starting cell, i.e., the starting positions of and .

Finding and can be done in linear space. Local alignment can be done in O(nm) time &

O(m) space.

Page 26: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded DifferencesImagine problems in which there is a bound on the

number of expected differences.

Q: Can we solve the alignment in faster than O(nm)?

A: Yes, if the alignment contains at most k differences O(km) is possible.

Core Idea: The main diagonal is comprised of cells (i,i), i n m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

Page 27: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded DifferencesCore Idea: The main diagonal is comprised of cells (i,i),

i n m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

Example: k = 3, (i, i) main diagonal, (i, i - l), (i, i + l), l > k bounds

Page 28: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded DifferencesRecall: a solution path must extend from cell (0,0) ending

along or to the right of the main diagonal in cell (n,m)

Observation: k >= m – n is required for a k-difference solution to exist.

Example: k = 3, (i, i) main diagonal, (i, i - l), (i, i + l), l > k bounds

Page 29: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded Differences

Q: How can we achieve time complexity O(km) in a table with O(nm) entries?

A: Only fill O(km) of the O(nm) cells straddling the main diagonal. Example: k = 3, (i, i) main diagonal, (i, i - l), (i, i + l), l k bounds

Page 30: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded Differences

Algorithm: Fill in the table in strips 2k+1 cells wide centered on the main diagonal.

Cell (2,5) is computed from cells (1,4) & (2,4) but not cell (1,5) Cell (5,3) is computed from cells (4,2) & (4,3) but not cell (5,2)

1,4 2,4

1,5

5,2 4,3 4,2 5,3

2,5

Note: The recurrence requires the three neighboring cells.

Q: How do we handle neighbors in the forbidden zone?

A: Ignore them.

Page 31: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded Differences

Thm. There is a global alignment of S1 and S2 with at most k differences IFF the algorithm from the previous slide assigns cell (n,m) the value k or less.

The k-difference global alignment problem can be solved in time O(km) and space O(km).

Page 32: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded DifferencesQ: What if we don’t know the value of k?Q: How can we decide on a k value?Soln. Start with k = 1.

If no solution is found let k = 2 * kRepeat the doubling of k until a solution is found.

• We double k to find the optimal value k*.• We stop doubling k when a solution is found.• k* will be the best alignment with the current value of k.• Since we have been doubling k, k* k.

Page 33: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Bounded Differences

Thm. The doubling of k, starting from1, yields a k-difference alignment with the edit distance k* and its alignment in O(k* m) time and space.

Proof: Let kL be the largest value of k used for a given pair of strings. Then kL 2k*. The effort involved is O(kLm + kLm/2 + kLm/4 + .. + m) = O(kLm). But, O(kLm) = O(k* m).

Q: Why do we state kL 2k* instead of kL < 2k* ?

Page 34: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

HomeworkDue 2/27/03

Part 1#24. Show how to solve the alphabet-weight alignment problem with affine

weights in O(nm) time.

#27. The recurrence relations we developed for the affine gap model follow the logic of paying Wg + Ws when a gap is initiated and then paying Ws for each additional space used in that gap. An alternative logic is to pay Wg + Ws at the point when the gap is “completed”. Write recurrence relations for the affine gap model that follows that logic. The recurrences should compute the alignment in O(nm) time.

Continued on next page.

Page 35: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Homework

Part 2#1. Show how to compute the value V(n,m) of the optimal

alignment using only min(n,m) +1 space in addition to the space needed to represent the two input strings.

#4. Show how to reduce the size of the strip needed in the method of Section 12.2.3, when |m - n| < k.

Continued on next page.

Page 36: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINACollege of Engineering & Information

Technology

College of Engineering & Information Technology

Homework

Part 2 continued

Gradstudents only:

#5. Fill in the details of how to find the actual alignments of P in T that occur with at most k differences. The method uses the O(km) values stored during the k differences algorithm. The solution is somewhat simpler if the k differences algorithm also stores a sparse set of pointers recording how each farthest-reaching d-path extends a farthest-reaching (d-1)-path. These pointers only take O(km) space and are a spare version of the standard dynamic programming pointers. Fill in the details of this approach as well.

Required portion of question.

Optional, extra credit portion of question.