Heuristic alignment algorithms;Cost matrices
2.5 – 2.9
Thomas van Dijk
Content Dynamic programming
Improve to linear space
Heuristics for database searches BLAST, FASTA
Statistical significance What the scores mean Score parameters (PAM, BLOSUM)
Dynamic programming
Improve to linear space
Dynamic programming
NW and SW run in O( nm ) time And use O( nm ) space
For proteins, this is okay. But DNA strings are huge!
Improve this to O( n+m ) space while still using O( nm ) time.
Basic idea
Full DP requires O( nm ) space
Basic idea for linear space Cells don’t directly depend on cells more
than one column to the left. So keep only two columns; forget the rest
No back-pointers!
How to find the alignment?
If we happen to know a cell on the optimal alignment…
Divide and conquer
We could then repeat this trick!
Divide and conquer
But how to find such a point?
Divide and conquer
Determine where the optimal alignment crossed a certain column: at every cell, remember at which row it crossed
the column.
Modified DP
Always only two columns at a time. Clearly O( n+m ) space.
But what about the time?
Space analysis
Using linear space DP at every step of a divide and conquer scheme.
What are we doing?
We do more work now, but how much? Look at case with two identical strings.
n2
Time analysis
n2
We do more work now, but how much? Look at case with two identical strings.
n2 + ½ n2
Time analysis
¼ n2
¼ n2
We do more work now, but how much? Look at case with two identical strings.
n2 + ½ n2 + ¼ n2
Time analysis
1/16 n2
1/16 n2
1/16 n2
1/16 n2
We do more work now, but how much? Look at case with two identical strings.
n2 + ½ n2 + ¼ n2 + … < 2n2
Time analysis
et cetera…
We do more work now, but how much? Look at case with two identical strings.
n2 + ½ n2 + ¼ n2 + … < 2n2
Along the same lines,the algorithm in generalis still O( nm ).
And actually only abouttwice as much work!
Time analysis
et cetera…
Questions?
Heuristics for database search
BLAST, FASTA
Searching a database
Database of strings Query string
Select best matching string(s) from the database.
Algorithms we already know
Until now, the algorithms were exact and correct (For the model!)
But for this purpose, too slow: Say 100M strings with 1000 chars each Say 10M cells per second Leads to ~3 hours search time
Getting faster
Full DP is too expensive
Space issue: fixed. With guaranteed same result
Now the time issue
BLAST and FASTA
Heuristics to prevent doing full DP - Not guaranteed same result. - But tend to work well.
Idea: First spend some time to analyze strings Don’t calculate all DP cells; Only some, based on analysis.
Basic idea In a good alignment, it is likely that several
short parts match exactly.
A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A
A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A
A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A
k-tuples
Decompose strings into k-tuples with corresponding offset.
E.g. with k=3, “A C C A B B” becomes
0: A C C 1: C C A 2: C A B 3: A B B
Do this for the database and the query
Example strings
A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A
A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A
A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A
3-tuple join
0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB10: CBB11: BBC12: BCB13: CBA
0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC10: CCD11: CDC12: DCD13: CDA
3
-5
3
-3
Matches / hits
Lots on the same diagonal: might be a good alignment.
Offset in query
Off
set
in d
b s
trin
g
Do e.g. “banded DP” around diagonals with many matches
Don’t do full DP
Offset in query
Off
set
in d
b s
trin
g
Some options
If no diagonal with multiple matches, don’t DP at all.
Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples.
…
Personal experience “Database architecture”
practical assignment
MonetDB: main memory DBMS SWISS-PROT decomposed
into 3-tuples 150k strings 150M 3-tuples
Find database strings with more than one match on the same diagonal.
Personal experience
43 char query string ~500k matches (in ~122k strings) ~32k diagonals with more than one
match in ~25k strings
With some implementation effort: ~1s(Kudos for Monet here!)
Personal experience
From 150k strings to 15k ‘probable’ strings in 1 second.
This discards 90% percent of database for almost no work.
And even gives extra information to speed up subsequent calculations.
… but might discard otherwise good matches.
Personal experience
Tiny query6 char query 122 diagonals in 119 stringsin no time at all
An actual protein from the database:250 char query ~285k diagonals in ~99k stringsin about 5 seconds
BLAST/FASTA conclusion
- Not guaranteed same result. - But tend to work well.
Questions?
Statistical significance
What do the scores mean?
Score parameters (PAM, BLOSUM)
What do the scores mean? We are calculating ‘optimal’ scores, but
what do they mean? Used log-odds to get an additive scoring scheme
Biologically meaningful versusJust the best alignment between random strings
1. Bayesian approach2. Classical approach
Bayesian approach
Interested in: Probability of a match, given the strings P( M | x,y )
Already know: Probability of strings given the models, i.e.
P( x,y | M ) and P( x,y | R )
So … Bayes rule.
Bayesian approach
Bayes rule gives us:
P( x,y | M ) P( M )P( M | x,y ) = P( x,y )
…rewrite… …rewrite… …rewrite…
Bayesian approach
P( M | x,y ) = σ( S’ )
S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) )
Our score!
Take care! Requires that substitution matrix
contains probabilities.
Mind the prior probabilities:when matching against a database, subtract a log(N) term.
Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.
Classical approach
Call the maximum score among N random strings: MN.
P( MN < x ) means “Probability that the best match from
a search of a large number N of unrelated sequences has score lower than x.”
is an Extreme Value Distribution Consider x = our score. Then if this
probability is very large: likely that our match was not just random.
Correcting for length
Additive score So longer strings have higher scores!
Correcting for length
If match with any string should be equally likely Correct for this bias by subtracting
log(length) from the score Because score is log-odds, this `is a
division’ that normalizes score
Scoring parameters
How to get the values in cost matrices?
1. Just count frequencies?2. PAM3. BLOSUM
Just count frequencies?
Would be maximum likelihood estimate
- Need lots of confirmed alignments
- Different amounts of divergence
PAM
Grouped proteins by ‘family.’ Phylogenetic tree
PAM1 matrix probabilities for 1 unit of time.
PAMn as (PAM1)n
PAM250 often used.
BLOSUM
Long-term PAM are inaccurate: Inaccuracies in PAM1 multiply! Actually differences between short-term
and long-term changes.
Different BLOSUM matrices are specifically determined for different levels of divergence Solves both problems.
Gap penalties
No proper time-dependent model
But seems reasonable that: expected number of gaps linear in time length of gaps constant distribution
Questions?
What have we seen?
Linear space DP
Heuristics:BLAST, FASTA
What the scores mean
Available substitution matrices:PAM, BLOSUM
Last chance for questions…
Top Related