Sequence Alignment
description
Transcript of Sequence Alignment
Sequence Alignment
Evolution
Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocation
Duplication
Sequence conservation implies function
Alignment is the key to• Finding important regions• Determining function• Uncovering the evolutionary forces
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gapin the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
What is a good alignment?AGGCTAGTT, AGCGAAGTTT
AGGCTAGTT- 6 matches, 3 mismatches, 1 gapAGCGAAGTTT
AGGCTA-GTT- 7 matches, 1 mismatch, 3 gapsAG-CGAAGTTT
AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gapsAG-CG-AAGTTT
Scoring Function• Sequence edits:
AGGCCTC
– Mutations AGGACTC .
– Insertions AGGGCCTC
– Deletions AGG . CTC
Scoring Function:Match: +mMismatch: -sGap: -d
Score F = (# matches) m - (# mismatches) s – (#gaps) d
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Too many possible alignments:
>> 2N
Alignment is additiveObservation:
The score of aligning x1……xMy1……yN
is additive
Say that x1…xi xi+1…xM aligns to y1…yj yj+1…yN
The two scores add up:
F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
Dynamic Programming• There are only a polynomial number of subproblems
– Align x1…xi to y1…yj
• Original problem is one of the subproblems– Align x1…xM to y1…yN
• Each subproblem is easily solved from smaller subproblems– ???
• Then, we can apply Dynamic Programming!!!
Let F(i,j) = optimal score of aligningx1……xi
y1……yj
Dynamic Programming (cont’d)
Notice three possible cases:
1. xi aligns to yjx1……xi-1 xiy1……yj-1 yj
2. xi aligns to a gapx1……xi-1 xiy1……yj -
3. yj aligns to a gapx1……xi -y1……yj-1 yj
m, if xi = yj
F(i,j) = F(i-1, j-1) + -s, if
not
F(i,j) = F(i-1, j) - d
F(i,j) = F(i, j-1) - d
Dynamic Programming (cont’d)How do we know which case is correct?
Inductive assumption:F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal
Then, F(i-1, j-1) + s(xi, yj)
F(i, j) = max F(i-1, j) – d F( i, j-1) – d
Where s(xi, yj) = m, if xi = yj; -s, if not
13
Pairwise Sequence Alignment
• As we’ve seen, sequence similarity is an indicator of homology
• There are other uses for sequence similarity– Database queries– Comparative genomics– …
14
Pairwise Sequence Alignment
• Example
• Which one is better?
HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
P-A--W-HEAE
HEAGAWGHE-E
--P-AW-HEAE
15
Scoring
• To compare two sequence alignments, calculate a score– PAM or BLOSUM matrices • Matches and mismatches
– Gap penalty• Initiating a gap
– Gap extension penalty• Extending a gap
A R N D C Q E G H I L K M F P S T W Y V B ZA 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6
PAM 250
C
-8 17
W
W
17
Example
A E G H WA 5 -1 0 -2 -3E -1 6 -3 0 -3H -2 0 -2 10 -3P -1 -1 -2 -2 -4W -3 -3 -3 -3 15
• Gap penalty: -8• Gap extension: -8
HEAGAWGHE-E
P-A--W-HEAE
HEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9
Exercise: Calculate for
18
Formal Description
• Problem: PairSeqAlign• Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e • Output: The optimal sequence alignment
19
How Difficult Is This?
• Consider two sequences of length n• There are
possible global alignments, and we need to find an optimal one from amongst those!
nnn
nn n
2
2
2)!()!2(2
20
So what?
• So at n = 20, we have over 120 billion possible alignments
• We want to be able to align much, much longer sequences– Some proteins have 1000
amino acids– Genes can have several
thousand base pairs
21
Dynamic Programming
• General algorithmic development technique• Reuses the results of previous computations– Store intermediate results in a table for reuse
• Look up in table for earlier result to build from
22
Global Alignment• Needleman-Wunsch 1970• Idea: Build up optimal alignment from optimal alignments of
subsequences
HEAG
--P-
-25
HEAGA
--P-A
-20
HEAGA
--P—
-33
HEAG-
--P-A
-33
Add score from table
Gap with bottom Gap with top Top and bottom
MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
23
Global Alignment
• Notation– xi – ith letter of string x– yj – jth letter of string y– x1..i – Prefix of x from letters 1 through I– F – matrix of optimal scores• F(i,j) represents optimal score lining up x1..i with y1..j
– d – gap penalty– s – scoring matrix
MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
24
Global Alignment
• The work is to build up F• Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd• Fill from top left to bottom right using the recursive
relation
djiFdjiFyxsjiF
jiFji
)1,(),1(
),()1,1(max),(
MSCS 230: Bioinformatics I - Pairwise Sequence Alignment
25
Global Alignment
F(i-1,j-1) F(i,j-1)
F(i-1,j) F(i,j)
s(xi,yj) d
d
Move ahead in both
xi aligned to gap
yj aligned to gap
While building the table, keep track of where optimal score came from, reverse arrows
26
Example
H E A G A W G H E E
0 -8 -16
-24
-32
-40
-48
-56
-64
-72
-80
P -8 -2 -9 -17
-25
-33
-42
-49
-57
-65
-73
A -16
W -24
H -32
E -40
A -48
E -56
A E G H WA 5 -1 0 -2 -3E -1 6 -3 0 -3H -2 0 -2 10 -3P -1 -1 -2 -2 -4W -3 -3 -3 -3 15
27
Completed Table
H E A G A W G H E E
0 -8 -16
-24
-32
-40
-48
-56
-64
-72
-80
P -8 -2 -9 -17
-25
-33
-42
-49
-57
-65
-73
A -16
-10
-3 -4 -12
-20
-28
-36
-44
-52
-60
W -24
-18
-11
-6 -7 -15
-5 -13
-21
-29
-37
H -32
-14
-18
-13
-8 -9 -13
-7 -3 -11
-19
E -40
-22
-8 -16
-16
-9 -12
-15
-7 3 -5
A -48
-30
-16
-3 -11
-11
-12
-12
-15
-5 2
E -56
-38
-24
-11
-6 -12
-14
-15
-12
-9 1
The Needleman-Wunsch Matrixx1 ……………………………… xMy
1 ……
……
……
……
……
……
yN
Every nondecreasing path
from (0,0) to (M, N)
corresponds to an alignment of the two sequences
An optimal alignment is composed of optimal subalignments
Performance
• Time:O(NM)
• Space:O(NM)
Design a Perl Program
• Is it doable by Perl?• How can we handle two-dimensional array?
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
M[i, j] = S[j*len+i]
M
S
i->
j|V
0 1 2 3 4 5 6 7 8 9 10 11
1 0
2 -8
3 -16
4 -24
5 -32
6 -40
7 -48
0 1 2 3 4 5 6 7 8 9 10
0
P -8
A -16
W -24
H -32
E -40
A -48
E -56
-38
-24
-11
-6 -12
-14
-15
-12
-9
Detail
H E A G A W G H E E
0 -8 -16
-24
-32
-40
-48
-56
-64
-72
-80
P -8 -2 -9 -17
-25
-33
-42
-49
-57
-65
-73
A -16
-10
-3 -4 -12
-20
-28
-36
-44
-52
-60
W -24
-18
-11
-6 -7 -15
-5 -13
-21
-29
-37
H -32
-14
-18
-13
-8 -9 -13
-7 -3 -11
-19
E -40
-22
-8 -16
-16
-9 -12
-15
-7 3 -5
A -48
-30
-16
-3 -11
-11
-12
-12
-15
-5 2
E -56
-38
-24
-11
-6 -12
-14
-15
-12
-9 1