Sequence Alignment

33
Sequence Alignment

description

Sequence Alignment. Evolution. Evolution at the DNA level. Deletion. Mutation. …AC GGTG CAGT T ACCA…. SEQUENCE EDITS. …AC ---- CAGT C CACCA…. REARRANGEMENTS. Inversion. Translocation. Duplication. Sequence conservation implies function. Alignment is the key to - PowerPoint PPT Presentation

Transcript of Sequence Alignment

Page 1: Sequence Alignment

Sequence Alignment

Page 2: Sequence Alignment

Evolution

Page 3: Sequence Alignment

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocation

Duplication

Page 4: Sequence Alignment

Sequence conservation implies function

Alignment is the key to• Finding important regions• Determining function• Uncovering the evolutionary forces

Page 5: Sequence Alignment

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gapin the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 6: Sequence Alignment

What is a good alignment?AGGCTAGTT, AGCGAAGTTT

AGGCTAGTT- 6 matches, 3 mismatches, 1 gapAGCGAAGTTT

AGGCTA-GTT- 7 matches, 1 mismatch, 3 gapsAG-CGAAGTTT

AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gapsAG-CG-AAGTTT

Page 7: Sequence Alignment

Scoring Function• Sequence edits:

AGGCCTC

– Mutations AGGACTC .

– Insertions AGGGCCTC

– Deletions AGG . CTC

Scoring Function:Match: +mMismatch: -sGap: -d

Score F = (# matches) m - (# mismatches) s – (#gaps) d

Page 8: Sequence Alignment

How do we compute the best alignment?

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Too many possible alignments:

>> 2N

Page 9: Sequence Alignment

Alignment is additiveObservation:

The score of aligning x1……xMy1……yN

is additive

Say that x1…xi xi+1…xM aligns to y1…yj yj+1…yN

The two scores add up:

F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Page 10: Sequence Alignment

Dynamic Programming• There are only a polynomial number of subproblems

– Align x1…xi to y1…yj

• Original problem is one of the subproblems– Align x1…xM to y1…yN

• Each subproblem is easily solved from smaller subproblems– ???

• Then, we can apply Dynamic Programming!!!

Let F(i,j) = optimal score of aligningx1……xi

y1……yj

Page 11: Sequence Alignment

Dynamic Programming (cont’d)

Notice three possible cases:

1. xi aligns to yjx1……xi-1 xiy1……yj-1 yj

2. xi aligns to a gapx1……xi-1 xiy1……yj -

3. yj aligns to a gapx1……xi -y1……yj-1 yj

m, if xi = yj

F(i,j) = F(i-1, j-1) + -s, if

not

F(i,j) = F(i-1, j) - d

F(i,j) = F(i, j-1) - d

Page 12: Sequence Alignment

Dynamic Programming (cont’d)How do we know which case is correct?

Inductive assumption:F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal

Then, F(i-1, j-1) + s(xi, yj)

F(i, j) = max F(i-1, j) – d F( i, j-1) – d

Where s(xi, yj) = m, if xi = yj; -s, if not

Page 13: Sequence Alignment

13

Pairwise Sequence Alignment

• As we’ve seen, sequence similarity is an indicator of homology

• There are other uses for sequence similarity– Database queries– Comparative genomics– …

Page 14: Sequence Alignment

14

Pairwise Sequence Alignment

• Example

• Which one is better?

HEAGAWGHEE

PAWHEAE

HEAGAWGHE-E

P-A--W-HEAE

HEAGAWGHE-E

--P-AW-HEAE

Page 15: Sequence Alignment

15

Scoring

• To compare two sequence alignments, calculate a score– PAM or BLOSUM matrices • Matches and mismatches

– Gap penalty• Initiating a gap

– Gap extension penalty• Extending a gap

Page 16: Sequence Alignment

A R N D C Q E G H I L K M F P S T W Y V B ZA 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6

PAM 250

C

-8 17

W

W

Page 17: Sequence Alignment

17

Example

A E G H WA 5 -1 0 -2 -3E -1 6 -3 0 -3H -2 0 -2 10 -3P -1 -1 -2 -2 -4W -3 -3 -3 -3 15

• Gap penalty: -8• Gap extension: -8

HEAGAWGHE-E

P-A--W-HEAE

HEAGAWGHE-E

--P-AW-HEAE

(-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9

Exercise: Calculate for

Page 18: Sequence Alignment

18

Formal Description

• Problem: PairSeqAlign• Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e • Output: The optimal sequence alignment

Page 19: Sequence Alignment

19

How Difficult Is This?

• Consider two sequences of length n• There are

possible global alignments, and we need to find an optimal one from amongst those!

nnn

nn n

2

2

2)!()!2(2

Page 20: Sequence Alignment

20

So what?

• So at n = 20, we have over 120 billion possible alignments

• We want to be able to align much, much longer sequences– Some proteins have 1000

amino acids– Genes can have several

thousand base pairs

Page 21: Sequence Alignment

21

Dynamic Programming

• General algorithmic development technique• Reuses the results of previous computations– Store intermediate results in a table for reuse

• Look up in table for earlier result to build from

Page 22: Sequence Alignment

22

Global Alignment• Needleman-Wunsch 1970• Idea: Build up optimal alignment from optimal alignments of

subsequences

HEAG

--P-

-25

HEAGA

--P-A

-20

HEAGA

--P—

-33

HEAG-

--P-A

-33

Add score from table

Gap with bottom Gap with top Top and bottom

Page 23: Sequence Alignment

MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

23

Global Alignment

• Notation– xi – ith letter of string x– yj – jth letter of string y– x1..i – Prefix of x from letters 1 through I– F – matrix of optimal scores• F(i,j) represents optimal score lining up x1..i with y1..j

– d – gap penalty– s – scoring matrix

Page 24: Sequence Alignment

MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

24

Global Alignment

• The work is to build up F• Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd• Fill from top left to bottom right using the recursive

relation

djiFdjiFyxsjiF

jiFji

)1,(),1(

),()1,1(max),(

Page 25: Sequence Alignment

MSCS 230: Bioinformatics I - Pairwise Sequence Alignment

25

Global Alignment

F(i-1,j-1) F(i,j-1)

F(i-1,j) F(i,j)

s(xi,yj) d

d

Move ahead in both

xi aligned to gap

yj aligned to gap

While building the table, keep track of where optimal score came from, reverse arrows

Page 26: Sequence Alignment

26

Example

H E A G A W G H E E

0 -8 -16

-24

-32

-40

-48

-56

-64

-72

-80

P -8 -2 -9 -17

-25

-33

-42

-49

-57

-65

-73

A -16

W -24

H -32

E -40

A -48

E -56

A E G H WA 5 -1 0 -2 -3E -1 6 -3 0 -3H -2 0 -2 10 -3P -1 -1 -2 -2 -4W -3 -3 -3 -3 15

Page 27: Sequence Alignment

27

Completed Table

H E A G A W G H E E

0 -8 -16

-24

-32

-40

-48

-56

-64

-72

-80

P -8 -2 -9 -17

-25

-33

-42

-49

-57

-65

-73

A -16

-10

-3 -4 -12

-20

-28

-36

-44

-52

-60

W -24

-18

-11

-6 -7 -15

-5 -13

-21

-29

-37

H -32

-14

-18

-13

-8 -9 -13

-7 -3 -11

-19

E -40

-22

-8 -16

-16

-9 -12

-15

-7 3 -5

A -48

-30

-16

-3 -11

-11

-12

-12

-15

-5 2

E -56

-38

-24

-11

-6 -12

-14

-15

-12

-9 1

Page 28: Sequence Alignment

The Needleman-Wunsch Matrixx1 ……………………………… xMy

1 ……

……

……

……

……

……

yN

Every nondecreasing path

from (0,0) to (M, N)

corresponds to an alignment of the two sequences

An optimal alignment is composed of optimal subalignments

Page 29: Sequence Alignment

Performance

• Time:O(NM)

• Space:O(NM)

Page 30: Sequence Alignment

Design a Perl Program

• Is it doable by Perl?• How can we handle two-dimensional array?

Page 31: Sequence Alignment

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Page 32: Sequence Alignment

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

M[i, j] = S[j*len+i]

M

S

i->

j|V

Page 33: Sequence Alignment

0 1 2 3 4 5 6 7 8 9 10 11

1 0

2 -8

3 -16

4 -24

5 -32

6 -40

7 -48

0 1 2 3 4 5 6 7 8 9 10

0

P -8

A -16

W -24

H -32

E -40

A -48

E -56

-38

-24

-11

-6 -12

-14

-15

-12

-9

Detail

H E A G A W G H E E

0 -8 -16

-24

-32

-40

-48

-56

-64

-72

-80

P -8 -2 -9 -17

-25

-33

-42

-49

-57

-65

-73

A -16

-10

-3 -4 -12

-20

-28

-36

-44

-52

-60

W -24

-18

-11

-6 -7 -15

-5 -13

-21

-29

-37

H -32

-14

-18

-13

-8 -9 -13

-7 -3 -11

-19

E -40

-22

-8 -16

-16

-9 -12

-15

-7 3 -5

A -48

-30

-16

-3 -11

-11

-12

-12

-15

-5 2

E -56

-38

-24

-11

-6 -12

-14

-15

-12

-9 1