BIC I, Week 4 lectures

44
1 BIC I, Week 4 lectures Rhys Price Jones and Anne Haake Rochester Institute of Technology [email protected] , [email protected]

description

BIC I, Week 4 lectures. Rhys Price Jones and Anne Haake Rochester Institute of Technology [email protected] , [email protected]. Overview of the need for Dynamic Programming. Consider Fibonacci - PowerPoint PPT Presentation

Transcript of BIC I, Week 4 lectures

Page 1: BIC I, Week 4 lectures

1

BIC I, Week 4 lectures

Rhys Price Jones and Anne Haake

Rochester Institute of Technology

[email protected], [email protected]

Page 2: BIC I, Week 4 lectures

2

Overview of the need for Dynamic Programming

• Consider Fibonacci• The obvious algorithm is elegant, easily

derived from the definition, and clearly correct.(define fib (lambda (n) (if (<= n 1) 1 (+ (fib (- n 2)) (fib (- n 1))))))

• But it’s hopelessly inefficient• Why?• Because it makes repeated recursive calls

with the same argument

Page 3: BIC I, Week 4 lectures

3

The Traditional Solution

• Change the order in which the computations are performed

• Change the logic of the program– So that it works “bottom up” instead of “top down”– Fill an array with calculated values starting with (fib 0), then

(fib 1) then (fib 2), etc.

• You can do it manually, as in fib.ss• That is dynamic programming!• The main problem is that it requires thought and

programming and hence may introduce error.

Page 4: BIC I, Week 4 lectures

4

It’s not just Fibonacci

• Many programs “write themselves” from the specification of the problem.

• When that happens, we are extremely pleased

• Sadly, the resulting program is often inefficient

• But dynamic programming is a technique to make it efficient again.

Page 5: BIC I, Week 4 lectures

5

Memo-izing

• Redefine the function calling mechanism so that:– We first check to see if we’ve made that calculation before– If no, go ahead and compute it but store the result in a hash

table– If yes, look up the previously computed value in the hash

table

• Do it once• Inefficient code becomes efficient automatically with

no re-programming memolambda.ss

memofib.ssmemofib.ss

Page 6: BIC I, Week 4 lectures

6

Another Example

• Pascal’s triangle• Each entry is the sum of its parents

– Cn,k = Cn-1,k-1 + Cn-1,k

– C0,k = Cn,0 = 1

• Leading to program• Runs really slowly• Replace lambda by memolambda

badcomb.ss

badcomb.ss

goodcomb.ss

Page 7: BIC I, Week 4 lectures

7

Review of Pattern Matching

• Does CGGA appear within the sequence ATCGCGTAACGGAGATAGGCTTA ?

• More generally, where does pattern p (length n) appear within text t (length m)

• Boyer-Moore, or Knuth-Morris-Pratt give O(m+n) search

• If p is going to change a lot and t stay the same, suffix tree can be built in O(m), each search is then O(n)

• If p is stable and there are lots of different t, virtual machine can be built in O(n) and then each search is O(m)

Page 8: BIC I, Week 4 lectures

8

Build a Virtual Machine

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 9: BIC I, Week 4 lectures

9

First Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 10: BIC I, Week 4 lectures

10

Second Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 11: BIC I, Week 4 lectures

11

Third Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 12: BIC I, Week 4 lectures

12

Fourth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 13: BIC I, Week 4 lectures

13

Fifth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 14: BIC I, Week 4 lectures

14

Sixth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 15: BIC I, Week 4 lectures

15

Seventh Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 16: BIC I, Week 4 lectures

16

Eighth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 17: BIC I, Week 4 lectures

17

Ninth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 18: BIC I, Week 4 lectures

18

Tenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 19: BIC I, Week 4 lectures

19

Eleventh Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 20: BIC I, Week 4 lectures

20

Twelfth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 21: BIC I, Week 4 lectures

21

Thirteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 22: BIC I, Week 4 lectures

22

Fourteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 23: BIC I, Week 4 lectures

23

Fifteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 24: BIC I, Week 4 lectures

24

Sixteenth Step

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 25: BIC I, Week 4 lectures

25

17th – 23rd Steps

• CGGA

AGT

C G G A

ACGT

C

C CAT

AT

GT

ATCGCGTAACGGAGATAGGCTTA

Page 26: BIC I, Week 4 lectures

26

Pattern Matching – Conclusion

• Exact pattern matching is easy• Often the naive algorithm is good enough• Fast algorithms are readily available• Sadly, not much use for biological tasks

Page 27: BIC I, Week 4 lectures

27

Why not?

• What’s the difference?• Mutation• Insertion/deletion gaps• We need an inexact way to compare two (or

more) biological sequences

Page 28: BIC I, Week 4 lectures

28

Pattern Matching vs. Sequence Alignment

• In the CS world, we talk of comparing strings, or matching patterns of characters within strings

• For biological applications, we talk of comparing sequences, or aligning sequences of nucleotides (or amino acids) to each other

Page 29: BIC I, Week 4 lectures

29

Evolutionary Relatedness

• Consider ACCGT and CACGT• How likely is it that they are “related”?• Possible alignments:• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT

• Which is better?

Page 30: BIC I, Week 4 lectures

30

It Depends

• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT

• Scoring 2 for a match, -2 for a mismatch, and –1 for a gap, 2 versus 6

• Scoring 2 for a match, 0 for a mismatch and –2 for a gap, 6 versus 4

• And we haven’t even begun to consider experimental evidence that might cause us to rank some mutations better than others!

Page 31: BIC I, Week 4 lectures

31

Distance measure

• Score 0 for a match• 1 for a mismatch or gap• Low score best!• ACCGT AC-CGTXX||| -|-|||CACGT -CACGT

• Now it’s 2 versus 2

Page 32: BIC I, Week 4 lectures

32

Global alignment

• For two sequences• - A C C A C C-ACACC

• Use the scoring scheme to fill in the table, starting with first row and first column

Page 33: BIC I, Week 4 lectures

33

First entries

• Using the distance measure• - A C C A C C- 0 1 2 3 4 5 6A 1 C 2A 3C 4A 5

• Each nucleotide<->gap costs 1 point

Page 34: BIC I, Week 4 lectures

34

Extending inwards

• Extending the distance measure• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 A 3 2 C 4 3 A 5 4

• Extending from North or West costs 1 point, from NW costs 0 (match) or 1 (mismatch)

• Pick cheapest of the three

Page 35: BIC I, Week 4 lectures

35

More extension

• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4 A 3 2 1 1 C 4 3 2 1 A 5 4 3 2

• mi,j = min (mi,j-1+g mi-1,j+g mi-1,j-1+cij)

• where cij = 0 for a match, 1 for a mismatch

Page 36: BIC I, Week 4 lectures

36

Getting there...

• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 A 5 4 3 2 1

• mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij)

• where cij = 0 for a match, 1 for a mismatch

Page 37: BIC I, Week 4 lectures

37

Almost done...

• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2

• mi,j = min (mi,j-1+1 mi-1,j+1 mi-1,j-1+cij)

• where cij = 0 for a match, 1 for a mismatch

Page 38: BIC I, Week 4 lectures

38

Finally, we can get a Global alignment

• One of the least-cost routes• - A C C A C C- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5 C 2 1 0 1 2 3 4A 3 2 1 1 1 2 3C 4 3 2 1 2 1 2A 5 4 3 2 1 2 2

• Can you see how this path leads to the alignment• ACCACCAC-ACA

Page 39: BIC I, Week 4 lectures

39

Global alignment program

• Distance measure• Runnable program• Dynamic Programming version

globalig.txt

globalig.ss

globaligm.ss

Page 40: BIC I, Week 4 lectures

40

Global vs Local Alignment

• Global alignment seeks the best alignment between the complete sequence and the complete sequenceA global alignment between GATCCACCA and GTAACACA might be

• G-ATCCACCA|-|X|-||-|GTAAC-AC-A

• A local alignment is the best alignment between subsequences. A local alignment between GATCCACCA and GTAACACA might be

• gATCCACca |X|-||gtAAC-ACa

• Best local alignment depends on scoring scheme

Page 41: BIC I, Week 4 lectures

41

Local Alignment

• For this demo, we will use a different measure– 2 for a match– -1 for a mismatch, -2 for a gap– Find best match withinG C T C T G C G A A T A GC G T T G A G A T A C T C

Page 42: BIC I, Week 4 lectures

42

The solution

• - G C T C T G C G A A T A G

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 2 0 2 0 0 2 0 0 0 0 0 0

G 0 2 0 1 0 1 2 0 4 2 0 0 0 2

T 0 0 1 2 0 2 0 1 2 3 1 2 0 0

T 0 0 0 3 1 2 1 0 0 1 2 3 1 0

G 0 2 0 1 2 0 4 2 2 0 0 1 2 3

A 0 0 1 0 0 1 2 3 1 4 2 0 3 1

G 0 2 0 0 0 0 3 1 5 3 3 1 1 5

A 0 0 1 0 0 0 1 2 3 7 5 3 3 3

T 0 0 0 3 1 2 0 0 1 5 6 7 5 3

A 0 0 0 1 2 0 1 0 0 3 7 5 9 7

C 0 0 2 0 3 1 0 3 1 1 5 6 7 8

T 0 0 0 4 2 5 3 1 2 0 3 7 5 6

C 0 0 2 2 6 4 4 5 3 1 1 5 6 4

• G C T C T G C G A A T A G

| | x | | X | |

C G T T G A G A - T A C T C

Page 43: BIC I, Week 4 lectures

43

The Program

• Has dynamic programming to make it fast!• This is basically Smith-Waterman• Work has been done on different scoring

schemes, gap penalties, etc.• Runs in time O(mn)

localig.ss

Page 44: BIC I, Week 4 lectures

44

Exercises

• that we will attempt in class:– amend global alignment program to do the

“backtracking” needed for the alignment

• that will be homework– amend local alignment program to do the

“backtracking” needed for the alignment