Download - Heuristic alignment algorithms; Cost matrices

Heuristic alignment algorithms;Cost matrices

2.5 – 2.9

Thomas van Dijk

Content Dynamic programming

Improve to linear space

Heuristics for database searches BLAST, FASTA

Statistical significance What the scores mean Score parameters (PAM, BLOSUM)

Dynamic programming

Improve to linear space

Dynamic programming

NW and SW run in O( nm ) time And use O( nm ) space

For proteins, this is okay. But DNA strings are huge!

Improve this to O( n+m ) space while still using O( nm ) time.

Basic idea

Full DP requires O( nm ) space

Basic idea for linear space Cells don’t directly depend on cells more

than one column to the left. So keep only two columns; forget the rest

No back-pointers!

How to find the alignment?

If we happen to know a cell on the optimal alignment…

Divide and conquer

We could then repeat this trick!

Divide and conquer

But how to find such a point?

Divide and conquer

Determine where the optimal alignment crossed a certain column: at every cell, remember at which row it crossed

the column.

Modified DP

Always only two columns at a time. Clearly O( n+m ) space.

But what about the time?

Space analysis

Using linear space DP at every step of a divide and conquer scheme.

What are we doing?

We do more work now, but how much? Look at case with two identical strings.

n2

Time analysis

n2


n2 + ½ n2

Time analysis

¼ n2

¼ n2


n2 + ½ n2 + ¼ n2

Time analysis

1/16 n2

1/16 n2

1/16 n2

1/16 n2


n2 + ½ n2 + ¼ n2 + … < 2n2

Time analysis

et cetera…


n2 + ½ n2 + ¼ n2 + … < 2n2

Along the same lines,the algorithm in generalis still O( nm ).

And actually only abouttwice as much work!

Time analysis

et cetera…

Questions?

Heuristics for database search

BLAST, FASTA

Searching a database

Database of strings Query string

Select best matching string(s) from the database.

Algorithms we already know

Until now, the algorithms were exact and correct (For the model!)

But for this purpose, too slow: Say 100M strings with 1000 chars each Say 10M cells per second Leads to ~3 hours search time

Getting faster

Full DP is too expensive

Space issue: fixed. With guaranteed same result

Now the time issue

BLAST and FASTA

Heuristics to prevent doing full DP - Not guaranteed same result. - But tend to work well.

Idea: First spend some time to analyze strings Don’t calculate all DP cells; Only some, based on analysis.

Basic idea In a good alignment, it is likely that several

short parts match exactly.

A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A


A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A

k-tuples

Decompose strings into k-tuples with corresponding offset.

E.g. with k=3, “A C C A B B” becomes

0: A C C 1: C C A 2: C A B 3: A B B

Do this for the database and the query

Example strings



A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A

3-tuple join

0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB10: CBB11: BBC12: BCB13: CBA

0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC10: CCD11: CDC12: DCD13: CDA

3

-5

3

-3

Matches / hits

Lots on the same diagonal: might be a good alignment.

Offset in query

Off

set

in d

b s

trin

g

Do e.g. “banded DP” around diagonals with many matches

Don’t do full DP

Offset in query

Off

set

in d

b s

trin

g

Some options

If no diagonal with multiple matches, don’t DP at all.

Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples.

…

Personal experience “Database architecture”

practical assignment

MonetDB: main memory DBMS SWISS-PROT decomposed

into 3-tuples 150k strings 150M 3-tuples

Find database strings with more than one match on the same diagonal.

Personal experience

43 char query string ~500k matches (in ~122k strings) ~32k diagonals with more than one

match in ~25k strings

With some implementation effort: ~1s(Kudos for Monet here!)

Personal experience

From 150k strings to 15k ‘probable’ strings in 1 second.

This discards 90% percent of database for almost no work.

And even gives extra information to speed up subsequent calculations.

… but might discard otherwise good matches.

Personal experience

Tiny query6 char query 122 diagonals in 119 stringsin no time at all

An actual protein from the database:250 char query ~285k diagonals in ~99k stringsin about 5 seconds

BLAST/FASTA conclusion

- Not guaranteed same result. - But tend to work well.

Questions?

Statistical significance

What do the scores mean?

Score parameters (PAM, BLOSUM)

What do the scores mean? We are calculating ‘optimal’ scores, but

what do they mean? Used log-odds to get an additive scoring scheme

Biologically meaningful versusJust the best alignment between random strings

1. Bayesian approach2. Classical approach

Bayesian approach

Interested in: Probability of a match, given the strings P( M | x,y )

Already know: Probability of strings given the models, i.e.

P( x,y | M ) and P( x,y | R )

So … Bayes rule.

Bayesian approach

Bayes rule gives us:

P( x,y | M ) P( M )P( M | x,y ) = P( x,y )

…rewrite… …rewrite… …rewrite…

Bayesian approach

P( M | x,y ) = σ( S’ )

S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) )

Our score!

Take care! Requires that substitution matrix

contains probabilities.

Mind the prior probabilities:when matching against a database, subtract a log(N) term.

Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.

Classical approach

Call the maximum score among N random strings: MN.

P( MN < x ) means “Probability that the best match from

a search of a large number N of unrelated sequences has score lower than x.”

is an Extreme Value Distribution Consider x = our score. Then if this

probability is very large: likely that our match was not just random.

Correcting for length

Additive score So longer strings have higher scores!

Correcting for length

If match with any string should be equally likely Correct for this bias by subtracting

log(length) from the score Because score is log-odds, this `is a

division’ that normalizes score

Scoring parameters

How to get the values in cost matrices?

1. Just count frequencies?2. PAM3. BLOSUM

Just count frequencies?

Would be maximum likelihood estimate

- Need lots of confirmed alignments

- Different amounts of divergence

PAM

Grouped proteins by ‘family.’ Phylogenetic tree

PAM1 matrix probabilities for 1 unit of time.

PAMn as (PAM1)n

PAM250 often used.

BLOSUM

Long-term PAM are inaccurate: Inaccuracies in PAM1 multiply! Actually differences between short-term

and long-term changes.

Different BLOSUM matrices are specifically determined for different levels of divergence Solves both problems.

Gap penalties

No proper time-dependent model

But seems reasonable that: expected number of gaps linear in time length of gaps constant distribution

Questions?

What have we seen?

Linear space DP

Heuristics:BLAST, FASTA

What the scores mean

Available substitution matrices:PAM, BLOSUM

Last chance for questions…