Bioinformatics Algorithms and Data Structures

36
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits Lecturer: Dr. Rose Slides by: Dr. Rose January 30 & February 1, 2007

description

Bioinformatics Algorithms and Data Structures. Chapter 11: Core String Edits Lecturer: Dr. Rose Slides by: Dr. Rose January 30 & February 1, 2007. Core String Edits. This chapter introduces inexact matching Inexact matching is used to compute similarity. - PowerPoint PPT Presentation

Transcript of Bioinformatics Algorithms and Data Structures

Page 1: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Bioinformatics Algorithms and Data Structures

Chapter 11: Core String Edits

Lecturer: Dr. RoseSlides by: Dr. Rose

January 30 & February 1, 2007

Page 2: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Core String Edits

• This chapter introduces inexact matching– Inexact matching is used to compute similarity.– Sequences similarity is a key concept.– Sequence similarity implies

• Structural similarity• Functional similarity

– We will consider a dynamic programming approach to inexact matching.

Page 3: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Edit Distance

• One measure of similarity between two strings is their edit distance.

• This is a measure of the number of operations required to transform the first string into the other.

• Single character operations:– Deletion of a character in the first string– Insertion of a character in the first string– Substitution of a character from the second character into the

second string– Match a character in the first string with a character of the second.

Page 4: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Edit DistanceExample from textbook: transform vintner to writers vintner replace v with w wintnerwintner insert r after w wrintnerwrintner match i wrintnerwrintner delete n writnerwritner match t writnerwritner delete n writerwriter match e writerwriter match r writerwriter insert s writers

Page 5: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Edit Distance

Let = {I, D, R, M} be the edit alphabetDefn. An edit transcript of two strings is a string

over describing a transformation of one string into another.

Defn. The edit distance between two strings is defined as the minimum number of edit operations needed to transform the first into the second. Matches are not included in the count.

Edit distance is also called Levenshtein distance.

Page 6: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Edit Distance

Defn. An optimal transcript is an edit transcript with the minimal number of edit operations for transforming one string into another.

Note: optimal transcripts may not be unique.Defn. The edit distance problem entails computing

the edit distance between two strings along with an optimal transcript.

Page 7: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

String Alignment

Defn. A global alignment of strings S1 and S2 is obtained by:

1. Inserting dashes/spaces into or at the ends of S1 and S2.2. Aligning the two strings s.t. each character/space in

either string is opposite a unique character/space in the other string.

Example 1: S1 = qacdbd S2 = qawxbq a c - d b dq a w x - b -

Page 8: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

String Alignment

Example 2: S1 = vintner S2 = writersv - i n t n e r -w r i - t - e r s

• Mathematically, string alignment and edit transcripts are equivalent.

• From a modeling perspective they are not equivalent.

• Edit transcripts express the idea of mutational changes.

Page 9: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Observation 1: There are many possible ways to transform one string into another.

• Observation 2: This is like the knapsack problem• Recall: dynamic programming is used to solve knapsack-

like problems.• Defn. Let D(i,j) denote the edit distance of S1[1..i] and

S2[1..j].– That is, D(i,j) is the minimum number of edit ops needed to

transform the first i characters of S1 into the first j characters of S2.

Page 10: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Notice that we can solve D(i,j) for all combination of lengths of prefixes of S1 and S2.

• Examples: D(0,0),.., D(0,j), D(1,0),..,D(1,j), … D(i,j) • Dynamic programming is a divide and conquer method.• The three parts to dynamic programming are:

– The recurrence relation– Tabular computation– Traceback

Page 11: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• The recurrence relation expresses the recursive relation between a problem and smaller instances of the problem.

• For any recursive relation, the base condition(s) must be specified.

• Base conditions for D(i,j) are:– D(i,0) = iQ: Why is this true? What does it mean in terms of edit ops?– D(0,j) = jQ: Why is this true? What does it mean in terms of edit ops?

Page 12: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• The general recurrence is given by:D(i,j) = min[D(i - 1, j) + 1, D(i, j - 1) + 1, D(i - 1, j - 1) + t (i,j) ]Here t (i,j) = 1 if S1(i) S2(j), o/w t (i,j) = 0.

• Proof of correctness on Pages 218-219• Basic argument: D(i,j) must be one of :

1. D(i - 1, j) + 12. D(i, j - 1) + 13. D(i - 1, j - 1) + t (i,j)There are NO other ways of creating S2[1..j] from S1[1..i].

Page 13: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

Q: How do we use the recurrence relation to efficiently compute D(i,j) ?

Wrong Answer: simply use recursion.Q: Why is this the wrong answer?A: recursion results in inefficient duplication of

computations for subproblems.Q: How much duplication?A: Exponential duplication!Example: Fibonacci numbers

Page 14: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

Example: Fibonacci numbersf(n) = f(n - 1) + f(n - 2)Base conditions: f(0) = 0, f(1) = 1

f1 f0 f2

f3 f1 f1 f0

f2 f4

f1 f0 f2

f3 f1

f1 f0 f2

f3 f1 f1 f0

f2 f4 f5

f6

Page 15: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Note: In calculating D(n,m), there are only (n + 1) (m + 1) unique combinations of i and j.

• Clearly an exponential number of computations is NOT required.

• Soln: instead of going top-down with recursion, go bottom-up. Compute each combination only once.

– Decide on a data structure to hold intermediate results. – Start from base conditions. These are the smallest D(i,j) values

and are already defined.– Compute D(i,j) for larger values of i and j.

Page 16: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming• Example: Fibonacci numbers

• Decide on a data structure: simple array

• Start from base conditions: f(0) = 0, f(1) = 1

• Compute f(i) for larger values of i. From bottom up.

0 1

0 1 1 2 3 5

• Each f(i) is computed only once!

Page 17: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Q: What kind of data structure should we use for edit distance?

1. Has to be a random access data structure.2. Has to support the dimensionality of the

problem.• D(i,j) is two-dimensional: S1 and S2.• We will use a two-dimensional array, i.e.,

a table.

Page 18: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingExample: edit distance from vintner to writers.Fill in the base condition values.

D(i,j) w r i t e r s0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7v 1 1 i 2 2n 3 3t 4 4n 5 5e 6 6r 7 7

Page 19: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Q: How do we fill in the other values?• A: use the recurrence:

D(i,j) = min[D(i - 1, j) + 1, D(i, j - 1) + 1, D(i - 1, j - 1) + t (i,j) ]where t (i,j) = 1 if S1(i) S2(j), o/w t (i,j) = 0.

• We can first compute D(1,1) because we have D(0,0), D(0,1), and D(1,0)

– D(1,1) = min[ 1+1, 1+1, 0+1] = 1

• Then we have all the values needed to compute in turn D(1,2), D(1,3),..,D(1,m)

Page 20: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingFirst compute D(1,1) because we have D(0,0), D(0,1), and D(1,0)Then compute in turn D(1,2), D(1,3),..,D(1,m)

D(i,j) w r i t e r s0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7 i 2 2n 3 3t 4 4n 5 5e 6 6r 7 7

Page 21: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingFill in subsequent values, row by row, from left to right.

D(i,j) w r i t e r s0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7 i 2 2 2 2 2 3 4 5 6n 3 3 3 3 3 3 4 5 6t 4 4 4 4 4 3 4 5 6n 5 5 5 5 5 4 4 5 6e 6 6 6 6 6 5 4 5 6r 7 7 7 6 7 6 5 4 5

Page 22: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingAlternatively, first compute D(1,1) from D(0,0), D(0,1), and D(1,0)Then compute in turn D(2,1), D(3,1),..,D(n,1)

D(i,j) w r i t e r s0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7 i 2 2 2n 3 3 3t 4 4 4n 5 5 5e 6 6 6r 7 7 7

Page 23: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingFill in subsequent values, column by column, from top to bottom.

D(i,j) w r i t e r s0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7v 1 1 1 2 3 4 5 6 7 i 2 2 2 2 2 3 4 5 6n 3 3 3 3 3 3 4 5 6t 4 4 4 4 4 3 4 5 6n 5 5 5 5 5 4 4 5 6e 6 6 6 6 6 5 4 5 6r 7 7 7 6 7 6 5 4 5

Page 24: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Filling each cell entails a constant number of operations.– Cell (i,j) depends only on characters S1(i) and S2(j) and

cells (i - 1, j - 1), (i, j - 1), and (i - 1, j).• There are O(nm) cells in the table• Consequently, we can compute the edit distance

D(n, m) in O(nm) time by computing the table in O(nm).

Page 25: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Having computed the table we know the value of the optimal edit transcript.

• Q: How do we extract the optimal edit transcript from the table?

• A: One way would be to establish pointers from each cell, to predecessor cell(s) from which its value was derived, i.e,– If D(i,j) = D(i - 1, j) + 1 add a pointer from (i,j) to (i - 1, j) – If D(i,j) = D(i, j - 1) + 1 add a pointer from (i,j) to (i, j - 1) – If D(i,j) = D(i - 1, j - 1) + t(i,j) add a pointer from (i,j) to (i - 1, j - 1)

Page 26: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

D(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Page 27: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• We can recover an optimal edit sequence simply by following any path from (n,m) to (0,0)

• The interpretation of the path links are:– A horizontal link , (i,j) (i,j-1), corresponds to an

insertion of character S2(j) into S1.– A vertical link, (i,j) (i-1,j), corresponds to a deletion

of S1(i) from S1.– A diagonal link, (i,j) (i-1,j-1), corresponds to a match

S1(i) = S2(j) and a substitution if S1(i) S2(j)

Page 28: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingD(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

Page 29: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingD(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

An optimal edit path.What edit transcript doesthis path correspond to?

S,S,S,M,D,M,M,I

Page 30: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingD(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

Another optimal edit path.What edit transcript doesthis path correspond to?

I,S,M,D,M,D,M,M,I

Page 31: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingD(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

The third possible optimal editpath. What edit transcriptdoes this path correspond to?

S,I,M,D,M,D,M,M,I

Page 32: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic Programming

• Alternatively we can interpret any path from (n,m) to (0,0) as an alignment of S1 and S2.

• The interpretation of the path links are:– A horizontal link , (i,j) (i,j-1), corresponds to an

insertion of a space/dash into S1.– A vertical link, (i,j) (i-1,j), corresponds to an

insertion of a space/dash into S2.– A diagonal link, (i,j) (i-1,j-1), corresponds to a match

if S1(i) = S2(j) or a mismatch if S1(i) S2(j)

Page 33: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingD(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

Possible optimal path.What alignment does thisoptimal path correspond to?

w r i t - e r sv i n t n e r -

Page 34: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingD(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

A second possible optimal path.What alignment does thisoptimal path correspond to?

w r i - t - e r sv - i n t n e r -

Page 35: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Dynamic ProgrammingD(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

A third possible optimal path.What alignment does thisoptimal path correspond to?

w r i - t - e r s- v i n t n e r -

Page 36: Bioinformatics Algorithms and Data Structures

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Summary

• Any path from (n,m) to (0,0) corresponds to an optimal edit sequence and an optimal alignment

• We can recover all optimal edit sequences and alignments simply by extracting all paths from (n,m) to (0,0)

• The correspondence between paths and edit sequences is one-to-one.

• The correspondence between paths and alignments is one-to-one.