Chapter 2 2.pdf · Chapter 2 •Changes in Sea ... Skeptical Science(2010) 14. Chapter 2
Chapter 2
description
Transcript of Chapter 2
Chapter 2
Pairwise Alignment
Pairwise Alignment
• Ask if two sequences are related
• First align the sequences (or parts of them) and then decide whether that alignment is more likely to have occurred because the sequences are related, or just by chances
Sequence Alignment
• Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences– Pair-wise alignment: compare two
sequences– Multiple sequence alignment: compare
more than two sequences
Example sequence alignment
• Task: align “abcdef” with “abdgf”• Write second sequence below the first
abcdefabdgf
• Move sequences to give maximum match between them
• Show characters that match using the identical letter
Example sequence alignment
abcdefababdgf
• Insert gap between b and d on lower sequence to allow d and f to align
Example sequence alignment
abcdefab d fab-dgf
Example sequence alignment
abcdefab d fab-dgf
• Note e and g don’t match
Matching Similarity vs. Identity
• Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters
• More on how to define similarity later
Global vs. Local Alignment
• We distinguish– Global alignment algorithms which optimize
overall alignment between two sequences – Local alignment algorithms which seek only
relatively conserved pieces of sequence• Alignment stops at the ends of regions of strong
similarity• Favors finding conserved patterns in otherwise
different pairs of sequences
Global vs. Local Alignment
• Global
LGPSSKQTGKGS-SRIWDN
L k GKG R D
LN-ITKSAGKGAIMRLGDA• Local
--------GKG--------
GKG
--------GKG--------
Why do sequence alignments?
• To find whether two (or more) genes or proteins are evolutionarily related to each other
• To find structurally or functionally similar regions within proteins
Key Issues
• What sorts of alignment should be considered
• The scoring system used to rank alignments
• The algorithm used to find optimal (or good) scoring alignments
• The statistical methods used to evaluate the significance of an alignment score
Example
• The following figure shows an example of
three pairwise alignments, all to the same
region of the human alpha globin protein
sequence (SWISS-POTR database
identifier HBA_HUMAN).
• Identical positions with letters, and ‘similar’ positions with a plus (+) sign
Example
• In the first alignment, there are many “matches”; many others are functionally conservative (D-E towards the end)
• The second alignment shows a biologically meaningful alignment (evolutionarily related, the same 3D structure, and same function in oxygen binding); many fewer identities
• The third alignment has a similar number of identities or conservative changes; A spurious alignment to a protein that has a completely different structure and function
Challenges
• How to distinguish the second one from the third one?
• The determination of the scoring system is crucial
• It is difficult to distinguish true alignments from spurious alignments
The Scoring Model
• When comparing sequences, we look for evidence that they have diverged from a common ancestor by a process of mutation and selection
• Basic mutational processes– Substitutions: change residues in a sequence– Insertions and deletions: add or remove residues
• Insertions and deletions are referred to as “gaps”
• The total score assigned to an alignment will be a sum of terms for each aligned pair of residues, plus terms for each gap
The Scoring Model
• We expect identities and conservative substitutions to be more likely in alignments than we expect by chance, and so to contribute positive score terms
• Non-conservative changes are expected to be observed less frequently in real alignments than we expect by chance, and so these contribute negative score terms
Assumption
• We can consider mutations at different sites in a sequence to have occurred independently
• This is reasonable for DNA and protein sequences
• The interactions between residues also play a very critical role
• Long range dependencies should be considered for structural RNAs
Substitution Matrices
• Consider a pair of sequences, x and y, of lengths n and m
• Let xi be the ith symbol in x and yj be the jth symbol in y
• These symbols come from some alphabet A; in the case of DNA this will be the four bases {A, G, C, T}, and in the case of proteins the twenty amino acids
• We will only consider ungapped global pairwise alignments, i.e., two completely aligned equal-length sequences
Rationale
• Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relatively likelihood that the sequences are related as opposed to being unrelated
• Assign a probability to the alignment in each of the two cases
• We consider the ratio of the two probabilities
Unrelated or Random Model
• Let R be the unrelated model• The letter a occurs independently with some fr
equency qa, and hence the probability of the two sequences is just the product of the probabilities of each amino acid:
j
yi
x jiqqRyxP )|,(
Alternative Match Model
• Let M be the alternative match model• Aligned pairs of residues occur with a joint proba
bility pab
• A probability for the whole alignment is
i
yx iipMyxP )|,(
The Odds Ratio
The ratio of these two likelihoods is known
as the odds ratio:
i yx
yx
i iyx
iyx
ii
ii
ii
ii
p
p
RyxP
MyxP
)|,(
)|,(
The Log Odds Ratio
We take the logarithm of the odds ratio:
is the log likelihood ratio of the residue pair (a, b)
occurring as an aligned pair, as opposed to an
unaligned pair
)log(),(
where
),(
ba
ab
iii
pbas
yxsS
Substitution Matrices
• The s(a,b) scores can be arranged in a matrix• For proteins, they form a 20X20 matrix (score m
atrix or substitution matrix)• Using BLOSUM50 matrix, the first alignment get
s a score of 130• PAM matrices• Any substitution matrix is making a statement ab
out the probability of observing ab pairs in real alignment
A R N D C Q E G H I L K M F P S T W Y V
A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0
R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3
N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3
D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4
C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1
Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3
E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3
G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4
H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4
I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4
L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1
K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3
M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1
F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1
P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3
S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0
W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3
Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1
V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5
BLOSUM50
A C D E F G H I K L M N P Q R S T V W Y A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 AlaC 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 CysD -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 AspE -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 GluF -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 PheG 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 GlyH -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 HisI -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 IleK -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 LysL -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 LeuM -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 MetN -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 AsnP -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 ProQ -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 GlnR -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 ArgS 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 SerT 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 ThrV 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 ValW -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 TrpY -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 Tyr Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Henikoff, S. and Henikoff, J.G. (1992) Proc. Nat. Acad. Sci. USA 89, 19915-10919.
BLOSUM62BLOSUM62BLOSUM62BLOSUM62
Gap Penalties
The standard cost associated with a gap of length g is given by a linear score
where d is called the gap-open penalty and e is called the gap-extension penalty.
egdg
gdg
)1()(
score affinean or
)(
Gap Penalties
• The gap-extension penalty e is usually set to something less than the gap-open penalty d, allowing long insertions and deletions to be penalized less than they would be by the linear gap cost
• This is desirable when gaps of a few residues are expected almost as frequently as gaps of a single residue
Gap Probability
The probability of a gap occurring at a particular site in agiven sequence is
qa probabilities are the same as those used in the random model. When we divide by the probability of this region according to the
random model to form the odds ratio, the qxi terms cancel out, so we are left only with a term dependent on length γ(g)=log(f(g)); i.e., gap penalties correspond to the log probability of a gap of that length.
gapin
)()gap(i
xiqgfP
Alignment Algorithms• Given a scoring system, we need to have an
algorithm for finding an optimal alignment for a pair of sequences
• While both sequences have the same length n, there is only one possible global alignment of the complete sequences
• When gaps are allowed, there are
possible global alignments between two sequences of length n
nn
n
n
n n
2
2
)!(
)!2(2 2
2
Example
(1) ab (2) ab- (3) ab- (4) -ab
cd -cd c-d cd-
(5) ab-- (6) -ab-
--cd c--d
Dynamic Programming
• Guarantee to find the optimal scoring alignment or set of alignments
• Central to computational sequence analysis
• Maximize the score to find the optimal alignment
Example
• We wish to align two short amino acid sequences: HEAGAWGHEE & PAWHEAE
• We use the BLOSUM50 score matrix, and a gap cost per unaligned residue of d=-8
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H 10 0 -2 -2 -2 -3 -2 10 0 0
E 0 6 -1 -3 -1 -3 -3 0 6 6
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
Global Alignment: Needleman-Wunsch Algorithm
• Construct a matrix F indexed by i, j, one index from each sequence
• F(i, j) is the score of the best alignment between the initial segment x1…i of x up to xi and the initial segment y1…j of y up to yj
• Begin F(0,0)=0, we then fill the matrix from top left to bottom right
• If F(i-1, j-1), F(i-1, j), and F(i, j-1) are known, it is possible to calculate F(i, j)
Three Ways of Alignments
• xi is aligned to yj
IGA xi
LGVyj
• xi is aligned to a gap
AIG A xi
GVyj - -• yj is aligned to a gap
GA xi - -
SLG Vyj
Three Ways of Alignments
• xi is aligned to yj, F(i, j)= F(i-1, j-1)+s(xi, yj)
• xi is aligned to a gap, F(i, j)= F(i-1, j)-d
• yj is aligned to a gap, F(i, j)= F(i, j-1)-d
Three Ways of Alignments
djiF
djiF
yxsjiF
jiFji
)1,(
,),1(
),,()1,1(
max),(
The Diagram
F(i-1,j-1) F(i,j-1)
s(xi,yj) -d
F(i-1,j) F(i,j)
-d
The F Matrix
• As we fill in the F(i, j), we also keep a pointer in each cell back to the cell from its F(i, j) was derived
• Along the top row, where j=0, the values F(i, j-1) and F(i-1, j-1) are not defined, so the values F(i, 0) must be handled specially
• The values F(i, 0) represent alignments of a prefix of x to all gaps in y, so we can define F(i, 0) =-id. Likewise, F(0, j)=-jd
• F(n, m) is the best score for an alignment of x1…n
to y1…m
The Global Dynamic Programming Matrix
Local Alignment: Smith-Waterman Algorithm
• In previous section, we know which sequences we want to align, and we are looking for the best match between them from one end to the other
• Most often, we are looking for the best alignment between subsequence of x and y
• When it is suspected that two protein sequence may share a common domain, or when comparing extended sections of genomic DNA sequence
Local Alignment
• It is the most sensitive way to detect similarity when comparing two very highly diverged sequences, even if they may share evolutionary origin
• In this case, only part of the sequence has been under strong enough selection to preserve detectable similarity; the rest will have accumulated so much noise through mutation that it is no longer alignable
• The highest scoring alignment of subsequences of x and y is called the best local alignment
The Algorithm
• The algorithm is closely related to that for global alignments
However, there are two differences.• In each cell in the table, F(i, j) is allowed to
take 0 if all other options have values less than 0
• An alignment can end anywhere in the matrix
The Algorithm
)1,(
,),1(),,()1,1(
,0
max),(
djiF
djiFyxsjiF
jiFji
The First Difference
• Taking the option 0 corresponds to starting a new alignment
• If the best alignment up to some point has a negative score, it is better to start a new one, rather then extend the old one
• The top row and left column will be filled with 0s, not –id and –jd as for global alignment
The Second Difference
• We look for the highest value of F(i, j) over the whole matrix, and start the traceback from there
• The traceback ends when we meet a cell with value 0, which corresponds to the start of the alignment
Example
Example
• Note that the local alignment is a subset of the global alignment, but that is not always the case
Global Alignment
HEAGAWGHE-E
– –P –AW –HEAE
Local Alignment
AWGHE
AW –HE
Local Alignment
• When considering local alignment, the expected score for a random match must be negative
• If this is not true, then long matches between entirely unrelated sequences will have high scores, just based on their length
• Some s(a, b) must be greater than 0, otherwise the algorithm won’t find any alignment at all
Expected Score
• Assume that there is no gap and successive positions are independent
• The expected score of a fixed length alignment can be evaluated by
where qa is the probability of a symbol a at any given position in a sequence.
),,(,
basqqba
ba
Expected Score
When s(a, b) is derived as a log likelihood ratio,
using the same qa as for the random model
probabilities, we will have
0log),(,,
ab
bab
baa
baba p
qqqqbasqq
Repeated Matches
• If one or both of the sequences are long, there may exist many different local alignments with a significant score, and in most cases we would be interested in all of these
• For example, there may have many copies of a repeated domain or motif in a protein
• We want to find one or more non-overlapping copies of sections of one sequence in the other
Repeated Matches
• We are interested in matches scoring higher than T
• Let y be the sequence containing the domain or motif, and x be the sequence in which we are looking for multiple matches
The Algorithm
• The meaning and the recurrence of F(i, j) are different
• In the final alignment, x will be partitioned into regions that match parts of y in gapped alignments, and region that are unmatched
• F(i, j) for j 1 is the best sum of match≧ scores to x1…i , assuming that xi is in a matched region, and the corresponding match ends in xi and yj
• F(i, 0) is the best sum of completed match scores to the subsequence x1…i , i.e. assuming that xi is in an unmatched region
The Algorithm
)1,(
,),1(
),,()1,1(
),0,(
),(
;,...,1 ,),1(
),0,1(max)0,(
,0)0,0(
djiF
djiF
yxsjiF
iF
jiF
mjTjiF
iFiF
F
ji
The Algorithm
• F(i, 0) handles unmatched regions and ends of matches, only allowing matches to end when they have at least T
• F(i, j) handles starts of matches and extensions• The total score of all matches is derived by
adding an extra cell to the matrix, F(n+1, 0)• This score will have T subtracted for each
match; if there were no matches of score greater than T, it will b 0
The Algorithm
• The individual match alignments can be obtained by tracking back from cell (n+1,0) to (0,0), at each point going back to the cell that was the source of the score in the current max() operation
• This traceback procedure is a global procedure, showing what each residue in x will be aligned to
Example
Remark
• The algorithm obtains all the local matches in one pass
• It finds the maximal scoring set of matches, in the sense of maximizing the combined total of excess of each match score above the threshold T
• Changing the value of T changes what the algorithm finds
• Increasing T may exclude matches
Remark
• Decreasing it may split them, as well as finding new weaker ones
• A locally optimal match in the sense of the preceding section will be split into pieces if it contains internal subalignment scoring less than -T
Overlap Matches
• We may expect that one sequence contains the other, or that they overlap
• It occurs when comparing fragments of genomic DNA sequence to each other, or to larger chromosomal sequences
• We do not penalize overhanging ends• We want a match to start on the top or left
border of the matrix, and finish on the right or bottom border
Example
The Algorithm
• F(i, 0)=0 for i=1,…,n and F(0, j)=0 for j=1,…,m• F(i, j) will be the same as the one in global align
ment
• We set Fmax to be the maximum value on the right border (i,m), i=1,…,n, and the bottom border (n,j), j=1,…,m
• The traceback starts from the maximum point and continues until the top or left edge is reached
Example
More Complex Models
• Previously we only consider the gap score to be a simple multiple of the length (γ(g)=-jd)
• This type of scoring is not ideal for biological sequence: it penalizes additional gap steps as much as the first
• We should penalize more for additional gap steps
• When gaps occur, they are often longer than one residue
More Complex Models
• We may use a general function for γ(g)
• This will require intensive computation
1,...,0 ),()1,(
1,...,0 ),(),1(
),,()1,1(
max),(
jkkjjiF
ikkijiF
yxsjiF
jiFji
More Complex Models
• This procedure now requires operations to align two sequences of length
• In each cell (i,j) we have to look at i+j+1 potential precursor, not just three as previously
2
)1()1(1
0 0
nmmnmnmnji
n
i
m
j
)( 3nO
More Complex Models
• Prohibitively costly increase in computational time in many case.
• Under some conditions computational time to ,although the constant of proportionality is higher in these case.
• In each cell have to look at 2K+1 potential precursors.
)( 2nO
12120 0
knmkn
i
m
j
Alignment with Affine Gap Scores
• γ(g)= -d-(g-1)e
d is the gap open penalty
e is the gap-extension penality
• O(n2)
• We now have to keep track of multiple values for each pair of residue coefficients (i,j) in place of the single value F(i,j)
Let M(i,j) be the best score up to (i,j) given
that xi is aligned to yj (left case), Ix(i,j) be the
best score given that xi is aligned to a gap (in an insertion w.r.t y, central case), and
Iy(i,j) be the best score given that yj is aligned to a gap (in an insertion w.r.t. x, right case).
Assumption
• We assume that a deletion will not be followed directly by an insertion
• This will be true for the optimal path if (-d-e) is less then the lowest mismatch score
The Recurrence Relations
The Diagram of the Relationships
The Interpretation
• The transitions each carry a score increment, and the stats each specify a ∆(i,j) pair, which is used to determine the change in indices i and j when that state is entered
• The new value for a state variable at (i,j) is the maximum of the scores corresponding to the transitions coming into the state
The Interpretation
• Each transition score is given by the value of the source state at the offsets specified by the ∆(i,j) pair of the target state, plus the specified score increment
• This type of description corresponds to a finite state automaton (FSA) in computer science
Alignment
An alignment corresponds to a path through
the states, with symbols from the underlying
pair of sequences being transferred to the
alignment according to the ∆(i,j) values in
the states.
The Algorithm
• Transitions to state M indicate letter-to-letter correspondences, so they are labeled with s(xi, yj) corresponding to the substitution score for replacing xi with yj
• We label every transition from M to a gap state (Ix or Iy) with the gap initiation penalty –d
• we label each transition from a gap state to itself with the gap extension penalty –e
Initialization
• M(0,0) = 0• Ix(i,0) = d + i * e• Iy(0,j) = d + j* e• The optimal alignment given by max [ M(n, m), Ix
(n, m), Iy(n, m)• Every path through this model corresponds to an
alignment• If we sum every score on the transitions of a pat
h, we will have the same score as a global alignment dynamic programming problem with affine gap penalties
Example
• x3 has been matched to y5 and we are in state M
• Now x4 has to be matched and it is best to match it to a gap on y
• The current state will change from M to Ix and the penalty of d (gap open) must be paid
• If x5 is also assigned to a gap on y, the penalty of e (gap extend) must be paid
• If then x6 is assigned to y6, the state will change back to M and the cost of s(x6,y6) will be added
Example
Alignment with Affine Gap Scores
• It is in fact frequent practice to implement an affine gap cost algorithm using only two states, M and I.
Alignment with Affine Gap Scores
• This is only guaranteed to provide the correct result if the lowest mismatch score is >= -2e
• For those interested in pursuing the subject, the simpler state-based automata are called Moore machine, and the transition-emitting systems are called Mealy machines
More Complex FSA Models
• Four-state FSA with two match states
• There may be high fidelity regions of alignment without gaps, corresponding to match state A
More Complex FSA Models
• Separated by lower fidelity regions with gaps, corresponding to match state B and gap states Ix and Iy
• Given an alignment path, there is also an implicit attachment of labels to the symbols in the original sequences, indicating which state was used to match them
Exercise 2.10
• Calculate the score of the example alignment in Figure 2.10, with d=12, e=2
Heuristic Alignment Algorithms
• Using dynamic programming to compute similarity between two sequences will cost O(n*m)
• With this cost, aligning a sequence against a database containing millions of sequences is not feasible
• The current protein database contains of the order of 10^8 residues, so far a sequence of length 10^3, approximately 10^11 matrix cells must be evaluated to search the complete database
Heuristic Alignment Algorithms
• The current protein database contains of the order of 10^8 residues, so far a sequence of length 10^3, approximately 10^11 matrix cells must be evaluated to search the complete database
• At ten million matrix cells a second, which is reasonable for a single workstation at the time this is being written, this would take 104 seconds, or around 3 hours
• If we want to search with many different sequences, time rapidly becomes an important issue
Heuristic Alignment Algorithms
• The goal of these method is to search as small a fraction as possible of cells in dynamic programming matrix, while still looking at all the high scoring alignment
• For the scoring matrices used to find distant matches, that exact methods become intractable, and we must use heuristic approaches that sacrifice some sensitivity
Heuristic Alignment Algorithms
• Two of the best-known algorithms are FASTA and BLAST
• They work from the same basic idea, namely that – most sequences in the database don’t match, and because of that the algorithms apply some heuristic to exclude many of the unrelated sequences
• If these heuristics are used there will be no guarantee that we will find the optimal scoring alignment
General Approach
• Seeds: All (good ungapped matches) to subsequences of X of a given length(X is our query)
• For each sequence Y in the database 1. Search for seeds in Y 2. Extend alignment around seeds or partition the alignment problem 3. If a high scoring match in last step was found, then use dynamic programming around a good match
FastaFast Alignment
• Seeds: All subsequences of length ktup
• Typical values of ktup are
- Proteins: 1-2
- DNA: 4-6
• Store all seeds with their starting position in X in a hashtable
Steps
1. Using the hashtable we can now find all exact matches(hotspots) to seeds in Y. The running time will be linear in the length of Y: O(|Y|)
2. Chain hotspots into runs of hotspots on the same diagonal. This is done efficiency by sorting hotspots on (j-i). A run consists of one or more consecutive hotspots a long a common diagonal
The Steps
(2-a) Using a function that takes a set of hotspots and their distance to each other, we can find out how good the runs are. Score(run) = δ(hotspot, distance)
(2-b) Pick 10 high scoring runs:
R1 = (α1, β1) …….
R10 = (α10, β10)
(2-c) If max{S(αi, βi)} 1 ≤ i ≤ 10 is sufficient high we continue
The Steps
(2-d) Now we can construct a weighted graph. If the longest path in this graph is sufficiently high we can continue to next step. The length of a path is the sum of the weights on the edges and the weights on the vertices of the path
The Steps
3. Perform banded dynamic programming around the high scoring run. The running time of this step is O( |X| + |Y| ), because ‘c’ is limited by a constant
BLASThttp://www.ncbi.nlm.nih.gov/BLAST/
Basic Local Alignment Search Tool
• The package provides programs for finding high scoring local alignment between a query sequence and a target database
• The idea is that true match alignments are vary likely to contain somewhere within them a short stretch of identities, or very high scoring matches.
BLAST
• Look initially for such short stretches and use them as ‘seeds’, from which to extend out in search of a good longer alignment.
• By keeping the seed segments short, it is possible to pre-process the query sequence to make a table of all possible seeds with their corresponding start point.
BLAST
• Make a list of all ‘neighbourhood words’ of a fixed length
• Scan through the database->whenever it finds a word in the set -> starts a ‘hit extension’ process to extend the possible match as an ungapped alignment in both directions->stopping at a maximum scoring extension
BLAST
• Only find ungapped alignments
• restricting to ungapped alignments misses only a small proportion of significant matches.
• Can find and report more than one high scoring match per sequence pair and can give significance values for combined scores.
BLAST
• Word: subsequence of length d• Typical values of d are -Protein: 3 -DNA: 11 • Seeds in BLAST are longer than seeds in
FASTA, but it isn’t necessary to find exact matches
• Seeds: All words w’ such that s (w, w’) ≥ t for some word w in X
The Steps
1. Find exact matches to seeds in Y
2. Extend each match to maximal extension and report matches with score > c
Linear Space Alignments
• We want to save the memory usage for computational resource
• All the algorithms described so far calculate score matrices such as F(i,j), which have overall size nm
• For two protein sequences, of typical length a few hundred residues, this is well within the capacity of modern desktop computers
• If one or both of the sequence is a DNA sequence tens or hundreds of thousands of bases long, the required memory for the full matrix can exceed a machine’s physical capacity
Linear Space Alignments
• To build the matrix take O(n2) space, this can be too much on larger n
• If we are only interested in the max score, there is no need for storing more than the current row j and the last row j-1
• Every value (i,j) of the current row is calculated from the predecessor (i-1,j) and from values of the last row
• By doing so, only linear space is required but we also miss the possibility to backtrack and so to find the max scoring sequence
Divide and Conquer Algorithm
• We halve the problem all the time
• The optimal alignment for the whole matrix will be the concatenation of the optimal alignments of the two submatrices
• These can again be calculated recursively either until the length is zero or until the length is so low, that is can be calculated with the standard O(n2) algorithm
The Approach
• We want to divide the matrix at the entry (i,j)• We set v=[m/2] and now we have to find the
column u so that the entry (u,v) is part of the optimal alignment for the whole matrix
• For each cell (i,j) with j>v, we store not only F(i,j) but also the column where the optimal path to (i,j) crossed row v
• Thus we can obtain the column u and so the cell (u,v) once we reach the entry (n,m)
Example
• F(2,2) has been calculated and at the same time, the parent is set to (1,1)
• If the score of F(2,3) is calculated from the score of F(2,2), the parent of (2,3) will be set to the parent of (2,2) i.e. (1,1)