Post on 03-Feb-2022
MPI for Developmental Biology, Tubingen
Pairwise sequence alignment
Christoph Dieterich
Department of Evolutionary BiologyMax Planck Institute for Developmental Biology
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Example
Alignment between very similar human alpha- and beta globins:
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLG+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KLGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Example
Plausible alignment to leghaemoglobin from yellow lupin:
GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL++ ++++H+ KV + +A ++ +L+ L+++H+ KNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Example
A spurious high-scoring alignment of human alpha globin to anematode glutathione S-transferase homologue:
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLGS+ + G + +D L ++ H+ D+ A +AL D ++AH+GSGYLVGDSLTFVDLLVAQHTADLL--AANAALLDEFPQFKAHQE
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Assessing the quality of an alignment
The goal is to use similarity-based alignments to uncoverhomology, while avoiding homoplasyHomoplasy: random mutations that appear in parallel orconvergently in two different lineages.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
The scoring model
Computation of an alignment critically depend on the choice ofparameters. Generally no existing scoring model can beapplied to all situations.
Evolutionary relationships between the sequences arereconstructed. Here scoring matrices based on mutationrates are usually applied.Protein domains are compared. Then the scoring matricesshould be based on composition of domains and theirsubstitution frequency.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
The scoring model
Computation of an alignment critically depend on the choice ofparameters. Generally no existing scoring model can beapplied to all situations.
Evolutionary relationships between the sequences arereconstructed. Here scoring matrices based on mutationrates are usually applied.Protein domains are compared. Then the scoring matricesshould be based on composition of domains and theirsubstitution frequency.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Substitution matrices
To be able to score an alignment, we need to determine scoreterms for each aligned residue pair.
Definition
A substitution matrix S over an alphabet Σ = {a1, . . . , aκ} hasκ× κ entries, where each entry (i , j) assigns a score for asubstitution of the letter ai by the letter aj in an alignment.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Substitution matrices
Basic idea: Follow scheme of statistical hypothesis testing.
Score(ab
) =f (a, b)
f (a) · f (b)
Frequencies of the letters f (a) as well as substitutionfrequencies f (a, b) stem from a representative data set.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Null hypothesis / Random model
Given a pair of aligned sequences (without gaps), the nullhypothesis states that the two sequences are unrelated (nothomologous). The alignment is then random with a probabilitydescribed by the model R. The unrelated or random model Rassumes that in each aligned pairs of residues the two residuesoccur independently of each other. Then the probability of thetwo sequences is:
P(X , Y | R) = P(X | R)P(Y | R) =∏
i
pxi
∏i
pyi .
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Match model
In the match model M, describing the alternative hypothesis,aligned pairs of residues occur with a joint probability pab,which is the probability that a and b have each evolved fromsome unknown original residue c as their common ancestor.Thus, the probability for the whole alignment is:
P(X , Y | M) =∏
i
pxi yi .
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Odds ratio
The ratio of the two gives a measure of the relative likelihoodthat the sequences are related (model M) as opposed to beingunrelated (model R). This ratio is called odds ratio:
P(X , Y | M)
P(X , Y | R)=
∏i pxi yi∏
i pxi
∏i pyi
=∏
i
pxi yi
pxi pyi
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Log-odds ratio
To obtain an additive scoring scheme, we take the logarithm(base 2 is usually chosen) to get the log-odds ratio:
S = log(P(X , Y | M)
P(X , Y | R)) = log(
∏i
pxi yi
pxi pyi
) =∑
i
s(xi , yi),
with
s(a, b) := log(
pab
papb
).
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
PAM matrices
Definition (PAM)
One point accepted mutation (1 PAM) is defined as anexpected number of substitutions per site of 0.01. A 1 PAMsubstitution matrix is thus derived from any evolutionary modelby setting the row sum of off-diagonal terms to 0.01 andadjusting the diagonal terms to keep the row sum equal to 1.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Definition (Jukes-Cantor Model)
The basic assumption is equality of substitution frequency forany nucleotide at any site. Thus, changing a nucleotide to eachof the three remaining nucleotides has probability α per timeunit. The rate of nucleotide substitution per site per time unit isthen r = 3α.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
PAM matrices
Let’s build a PAM 1 matrix under a Jukes-Cantor model ofsequence evolution.
1− 3α α α αα 1− 3α α αα α 1− 3α αα α α 1− 3α
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
PAM matrices
We scale matrix entries such that the expected number ofsubstitutions per site is 0.01 = 3α and obtain a probabiltymatrix:
0.99 0.003 0.003 0.0030.003 0.99 0.003 0.0030.003 0.003 0.99 0.0030.003 0.003 0.003 0.99
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
PAM matrices
A scoring matrix is then obtained by computing the log-oddsratios:
s(a, b) := log(
pab
papb
).
with pA = pC = pG = pT = 0.25 and joint probabilities as givenby the PAM probability matrix.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
PAM matrices
This leads to the following substitution score matrix:398 −438 −438 −438−438 398 −438 −438−438 −438 398 −438−438 −438 −438 398
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
BLOCKS and BLOSUM matrices
The BLOSUM matrices were derived from the databaseBLOCKS1 Blocks are multiply aligned ungapped segmentscorresponding to the most highly conserved regions of proteins.
1Henikoff, S and Henikoff, JG (1992) Amino acid substitution matricesfrom protein blocks. Proc Natl Acad Sci U S A. 89(22):10915-9. BLOCKSdatabase server: http://blocks.fhcrc.org/
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
BLOCKS and BLOSUM matrices
For the scoring matrices of the BLOSUM (=BLOcksSUbstitution Matrix) family all blocks of the database areevaluated columnwise. For each possible pair of amino acidsthe frequency f (ai , aj) of common pairs (ai , aj) in all columns isdetermined.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
BLOCKS and BLOSUM matrices
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
BLOCKS and BLOSUM matrices
Altogether there are(n
2
)possible pairs that we can draw from
this alignment. We now assume that the observed frequenciesare equal to the frequencies in the population. Then
paa = observed/
(n2
)
The observed frequency of a single amino acid is generallycomputed as pa = paa +
∑b 6=a pab/2. For this example we then
get pA = 0.8 + 0.2/2 = 0.9 and pC = 0.1.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
BLOCKS and BLOSUM matrices
Different levels of the BLOSUM matrix can be created bydifferentially weighting the degree of similarity betweensequences. For example, a BLOSUM62 matrix is calculatedfrom protein blocks such that if two sequences are more than62% identical, then the contribution of these sequences isweighted to sum to one.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
BLOCKS and BLOSUM matrices
BLOSUM62 is scaled so that its values are in half-bits, ie. thelog-odds were multiplied by 2/ log2 2 and then rounded to thenearest integer value.
A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3............................
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Gap penalties
Gaps are undesirable and thus penalized. The standard costassociated with a gap of length g is given either by a linearscore
γ(g) = −gd
or an affine score
γ(g) = −d − (g − 1)e,
where d is the gap open penalty and e is the gap extensionpenalty.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Gap penalties
Usually, e < d , with the result that less isolated gaps areproduced, as shown in the following comparison:
Linear gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKK--------VA--D----A-SALSDLHAHKL
Affine gap penalty: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKKVADA---------------SALSDLHAHKL
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Alignment algorithms
Given a scoring scheme, we need to have an algorithm thatcomputes the highest-scoring alignment of two sequences.As for the edit distance-based alignments we will discussalignment algorithms based on dynamic programming. Theyare guaranteed to find the optimal scoring alignment.Note of caution: Optimal Pairwise alignment algorithms are ofcomplexity O(n ·m)
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Global alignment: Needleman-Wunsch algorithm
Problem
Consider the problem of obtaining the best global alignment oftwo sequences. The Needleman-Wunsch algorithm is adynamic program that solves this problem.
Idea: Build up an optimal alignment using previous solutions foroptimal alignments of smaller substrings.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Global alignment: Needleman-Wunsch algorithm
F : {1, 2, . . . , n} × {1, 2, . . . , m} → R
in which F (i , j) equals the best score of the alignment of thetwo prefixes (x1, x2, . . . , xi) and (y1, y2, . . . , yj).
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Global alignment: Needleman-Wunsch algorithm
0 x1 x2 x3 . . . . . . xi−1 xi . . . . . . xn
0 F (0, 0) |y1 |y2 |y3 |
|. . . |
yj−1 F (i − 1, j − 1) F (i , j − 1)↘ ↓
yj − − − − − F (i − 1, j) → F (i , j)
. . .
ym
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
We obtain F (i , j) as the largest score arising from these threeoptions:
F (i , j) := max
F (i − 1, j − 1) + s(xi , yj)F (i − 1, j)− dF (i , j − 1)− d .
This is applied repeatedly until the whole matrix F (i , j) is filledwith values.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Recursion
To complete the description of the recursion, we need to set thevalues of F (i , 0) and F (0, j) for i 6= 0 and j 6= 0:We set F (i , 0) = for i = 0, 1, . . . , n andwe set F (0, j) = for j = 0, 1, . . . , m.The final value F (n, m) contains the score of the best globalalignment between X and Y .
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Example of a global alignment matrix
D 0 G A T T A G0 0 -2 -4 -6 -8 -10 -12A -2 -1 -1 -3 -5 -7 -9T -4 -3 -2 0 -2 -4 -6T -6 -5 -4 -1 1 -1 -3A -8 -7 -4 -3 0 2 0C -10 -9 -6 -5 -2 0 1
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Pseudo code of Needleman-Wunsch
Input: two sequences X and YOutput: optimal alignment and score αInitialization:Set F (i, 0) := −i · d for all i = 0, 1, 2, . . . , nSet F (0, j) := −j · d for all j = 0, 1, 2, . . . , mFor i = 1, 2, . . . , n do:
For j = 1, 2, . . . , m do:
Set F (i, j) := max
8<:F (i − 1, j − 1) + s(xi , yj )F (i − 1, j) − dF (i, j − 1) − d
Set backtrace T (i, j) to the maximizing pair (i′, j′)The score is α := F (n, m)Set (i, j) := (n, m)
repeatif T (i, j) = (i − 1, j − 1) print
“xiyj
”else if T (i, j) = (i − 1, j) print
“xi−
”else print
“−yj
”Set (i, j) := T (i, j)
until (i, j) = (0, 0).
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Complexity of Needleman-Wunsch
We need to store (n + 1)× (m + 1) numbers. Each numbertakes a constant number of calculations to compute: threesums and a max.
Hence, for filling the matrix, the algorithm requires O(nm) timeand memory. Given the filled matrix, the construction of thealignment is done in time O(n + m).
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Local alignment: Smith-Waterman algorithm
Global alignment is applicable when we have two similarsequences that we want to align from end-to-end, e.g. twohomologous genes from related species.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Local alignment: Smith-Waterman algorithm
Problem
Global alignment is inapplicable to modular sequence.
Here we would like to find the best match between substringsof two sequence.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Local alignment: Smith-Waterman algorithm
TCCCAGTTATGTCAGGGGACACGAGCATGCAGAGAC
AATTGCCGCCGTCGTTTTCAGCAGTTATGTCAGATC
Here the score of an alignment between two substrings wouldbe larger than the score of an alignment between the fulllengths strings.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Local alignment: Smith-Waterman algorithm
Definition
Let X = x1 . . . xn and Y = y1 . . . ym be two sequences over analphabet Σ. Let δ be a score function for an alignment. A localalignment of X and Y is a global alignment of substringsX ′ = xi1 . . . xi2 and Y ′ = yj1 . . . yj2 . An alignment A = (X ′, Y ′) ofsubstrings X ′ and Y ′ is an optimal local alignment of X and Ywith respect to δ if
δ(A) = maxA′
{δ(X ′, Y ′)|X ′ is a substring of X , Y ′ is a substring of Y}
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Example
Let X = AAAAACTCTCTCT and Y = GCGCGCGCAAAAA. Lets(a, a) = +1, s(a, b) = −1 and s(a,−) = s(−, a) = −2 be ascoring function. Then an optimal local alignment
AAAAA(CTCTCTCT)|||||
(GCGCGCGC)AAAAA
in this case has a score 5 whereas the optimal global alignment
AAAAACTCTCTCT| |
GCGCGCGCAAAAA
has score -11.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Local alignment: Smith-Waterman algorithm
The Smith-Waterman ( Smith, T. and Waterman, M. Identification ofcommon molecular subsequences. J. Mol. Biol. 147:195-197, 1981 )localalignment algorithm is a modification of the global alignmentalgorithm.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Modification in main recursion
In the main recursion, we set the value of F (i , j) to zero, if allattainable values at position (i , j) are negative:
F (i , j) = max
0,F (i − 1, j − 1) + s(xi , yj),F (i − 1, j)− d ,F (i , j − 1)− d .
The value F (i , j) = 0 indicates that we should start a newalignment at (i , j). This is because, if the best alignment up to(i , j) has a negative score, then it is better to start a new one,rather than to extend the old one.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Base conditions
For local alignments we need to set F (i , 0) = andF (0, j) = for all i = 0, 1, 2, . . . , n and j = 0, 1, 2, . . . , m.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Modification in traceback
Instead of starting the traceback at (n, m), we start it at the cellwith the highest score: argmax F (i , j). The traceback endsupon arrival at a cell with score 0, with corresponds to the startof the alignment.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Traceback via recursion
Input: Similarity matrix M of two strings s = s1 . . . sm and t = t1 . . . tnOutput: Optimal local alignment (s’,t’) of s and tProcedure Align(i,j):if M(i,j) 0 thens′ := εt′ := εelse
if (M(i, j) = M(i − 1, j) + g then(s, t) := Align(i − 1, j)s′ := concat(s, si )t′ := concat(t,′ −′)else if (M(i, j) = M(i, j − 1) + g then(s, t) := Align(i, j − 1)s′ := concat(s,′ −′)t′ := concat(t, tj )else(s, t) := Align(i − 1, j − 1)s′ := concat(s, si )t′ := concat(t, tj )
return(s’,t’)
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Local alignment: Smith-Waterman algorithm
For this algorithm to work, we require that the expected scorefor a random match is negative, i.e. that∑
a,b∈Σ
pa · pb · s(a, b) < 0,
where pa and pb are the probabilities for seeing the symbol a orb respectively, at any given position. Otherwise, matrix entrieswill tend to be positive, producing long matches betweenrandom sequences.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Local vs. Global Alignment
The Global Alignment Problem tries to find the optimal pathbetween vertices (0, 0) and (n, m) in the matrix graph.The Local Alignment Problem tries to find the optimal pathamong paths between arbitrary vertices (i , j) and (i ′, j ′) in thematrix graph such that i < i ′ and j < j ′.
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment
MPI for Developmental Biology, Tubingen
Pairwise alignment Substitution matrices Gap penalties Global Alignment algorithm Local Alignment algorithm
Example
Smith-Waterman matrix of the sequences GATTAG and ATTACwith s(a, a) = 1, s(a, b) = −1 and s(a,−) = s(−, a) = −2:
F 0 G A T T A G0ATTAC
Score: ;Alignment =
Christoph Dieterich Max Planck Institute for Developmental Biology
Pairwise sequence alignment