Edit Distance – Levenshtein Sequence Alignment – Needleman...

21
1 Edit Distance – Levenshtein Sequence Alignment – Needleman & Wunsch Not in Book CS380 Algorithm Design and Analysis

Transcript of Edit Distance – Levenshtein Sequence Alignment – Needleman...

Page 1: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

1

Edit Distance – Levenshtein Sequence Alignment – Needleman & Wunsch

Not in Book

CS380 Algorithm Design and Analysis

Page 2: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

2

EDIT DISTANCE

http://en.wikipedia.org/wiki/Levenshtein_distance

CS380 Algorithm Design and Analysis

Page 3: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

3

Edit Distance

•  Mutation in DNA is evolutionary.

•  DNA replication errors cause o  Substitutions

o  Insertions

o  Deletions

•  of nucleotides, leading to “edited” DNA texts

CS380 Algorithm Design and Analysis

Page 4: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

4

Edit Distance: Definition

•  Introduced by Vladimir Levenshtein in 1966

•  The Edit Distance between two strings is the minimum number of editing operations needed to transform one string into another

•  Operations are: o  Insertion of a symbol

o  Deletion of a symbol

o  Substitution of one symbol for another CS380 Algorithm Design and Analysis

Page 5: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

5

Example

CS380 Algorithm Design and Analysis

•  How would you transform: o  X: TGCATAT

•  To the string: o  Y: ATCCGAT

Page 6: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

6

Edit Distance

•  How many insertions, deletions, substitutions will transform one string into another?

•  Backtracking will give us the steps used to convert one string to another

CS380 Algorithm Design and Analysis

Page 7: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

7

Recursive Solution

•  Let dij = the minimum edit distance of x1x2x3..xi and y1y2y3..yi

CS380 Algorithm Design and Analysis

Insertion to X

Substitution

Deletion from X

Match

Page 8: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

8

Backtracking

•  No need to keep track of the arrows

•  Just know that: o  Match/Substitution: Diagonal

o  Insertion: Horizontal (Left)

o  Deletion: Vertical (Up)

CS380 Algorithm Design and Analysis

Page 9: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

9

Example

•  X = ATCGTT

•  Y = AGTTAC

CS380 Algorithm Design and Analysis

Page 10: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

10

SEQUENCE ALIGNMENT

Kleinberg, Tardos, Algorithm Design, Pearson Addison Wesley, 2006, p 278

http://www.aw-bc.com/info/kleinberg/

CS380 Algorithm Design and Analysis

Page 11: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

11

Sequence Alignment

•  Edit Distance: o  Gave the minimum number of changes to

convert one string into another

•  Sequence Alignment o  Maximizes the similarity by giving weights to

types of differences

CS380 Algorithm Design and Analysis

Page 12: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

12

Sequence Alignment

•  Needleman-Wunsch

•  Similarity based on gaps and mismatches

•  Generalized form of Levenshtein o  additional parameters:

§  gap penalty, δ §  mismatch cost ( αx,y ; αx,x = 0 )

CS380 Algorithm Design and Analysis

Page 13: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

13

Recurrence

•  Two strings x1...xm and y1...yn

•  In an optimal alignment, M, at least one of the following is true: o  (xm, yn) is in M

o  xm is not matched

o  yn is not matched

CS380 Algorithm Design and Analysis

Page 14: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

14

Recurrence

•  So, for i and j > 0

CS380 Algorithm Design and Analysis

Page 15: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

15

Example

•  Assume that: o  δ = 2

o  α (v, v) = 1

o  α (c, c) = 1

o  α (v, c) = 3

•  What is the cost of aligning the strings: o  mean

o  name CS380 Algorithm Design and Analysis

Page 16: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

16

SPACE-EFFICIENT SEQUENCE ALIGNMENT

CS380 Algorithm Design and Analysis

Page 17: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

17

Sequence Alignment Space Usage

•  O(n2) is pretty low space usage

•  However, for a 10GB genome, you’d need a huge amount of memory

•  Can we use less? o  Hirschberg’s algorithm

o  1975

CS380 Algorithm Design and Analysis

Page 18: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

18

Linear Space for Alignment Scores

•  If you are only interested in the cost of the alignment, you need to only use O(n) space

•  How? o  When filling the entries, we only ever look at the

current and previous cols

o  Only keep those two in memory

CS380 Algorithm Design and Analysis

Page 19: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

19

Space-Efficient-Alignment (X, Y)

CS380 Algorithm Design and Analysis

Page 20: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

20

Actual Alignment

•  How do we recover the actual alignment?

•  Do we need the entire matrix?

CS380 Algorithm Design and Analysis

Page 21: Edit Distance – Levenshtein Sequence Alignment – Needleman ...zeus.cs.pacificu.edu/shereen/cs380sp15/Lectures/16Lecture.pdf · 1 Edit Distance – Levenshtein Sequence Alignment

21

Divide-and-Conquer-Alignment (X,Y)

CS380 Algorithm Design and Analysis