PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational...

18
PatternHunter II: Highly Sensiti ve and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林林林 Ming Li, Bin Ma Derek Kisman, John Tromp

Transcript of PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational...

Page 1: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

PatternHunter II: Highly Sensitive and Fast Homology Search

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

R94922059 林語君

Ming Li, Bin MaDerek Kisman, John Tromp

Page 2: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

2

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Overview Homology search

Local alignment algorithms PH I PH II

Multiple Spaced Seeds Computing hit probability Finding a good seed set PH II Design Performance

Page 3: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

3

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Local alignment Smith-Waterman

Smith and Waterman, 1981; Waterman and Eggert, 1987 SSearch

FastA Wilbur and Lipman, 1983; Lipman and Pearson, 1985

BLAST Altschul et al., 1990; Altschul et al., 1997 Blast Family: BLASTN, BLASTP, etc. MEGABLAST

Page 4: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

4

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

PatternHunter Seed

Tradeoff: sensitivity <-> computation Consecutive k letters

k=11 in Blastn, k=28 in MegaBlast Nonconsecutive k letters

Spaced seed A model of k as its weight

Page 5: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

5

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

PatternHunter II Genome Informatics 14 (2003) Extend single optimized spaced seed of

PH to multiple ones Speed: BLASTN (MEGABLAST) Sensitivity: Smith-Waterman (SSearch)

Page 6: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

6

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Definition A homologous region, R A seed hits R A seed set A={a1,…ak} hits R Similarity

R has p=x% identities Sensitivity

Hit probability Optimal (DP) = 1

Page 7: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

7

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Computing Hit Probability NP-hard on multiple seeds DP on 1 seed Extend DP to multiple seeds

Page 8: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

8

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Computing Hit Probability of Multiple Seeds

Let A={a1,…ak} be a set of k seeds and R a random region of Length L with similarity level p.

Binary string b is a suffix of R[0:i]

Answer: f ( L,Є ), Є = empty string

bipfbifpbif 1,0,)1(,

Page 9: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

9

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Computing Hit Probability of Multiple Seeds

Page 10: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

10

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Computing Hit Probability of Multiple Seeds

Page 11: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

11

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Finding a Good Seed Set NP-hard for both optimal seed and

multiple seeds Greedy

Page 12: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

12

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Finding a Good Seed Set Compute the 1st seed a1 which

maximizes the hit probability of {a1} Compute the 2nd seed a2 which

maximizes the hit probability of {a1, a2}

Repeat until Reach the desired number of seeds Reach the desired hit probability

Page 13: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

13

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Finding a Good Seed Set May not optimize the combined hit

probability Good enough

Optimal• 16 weight, 11 seeds, L=64, similarity=70%, first four seeds:

{111010010100110111,111100110010100001011,110100001100010101111,1110111010001111}

Greedy• 16 weight, 12 seeds, L=64, similarity=70%, first four seeds:

{111010010100110111,1111000100010011010111,1100110100101000110111,1110100011110010001101}

Page 14: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

14

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Performance of the seeds From low to high

Solid: weight-11 k=1,2,4,8,16 seeds Dashed: 1-seed, weight=10,9,8,7

Page 15: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

15

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Performance of the seeds Reducing the weight by 1

Increase the expected number of hits by a factor of 4

Doubling the number of seeds Increase the expected number of hits

by a factor of 2 Better: Multiple seeds

Page 16: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

16

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

PH II Performance Compare with Blast(Blastn), Smith-Wat

erman(SSearch) Sensitivity of SSearch = 1 Alignment score

BLAST methods (hash, DP) match=1, mismatch=-1, gapopen=-5, gapex

tension=-1

Page 17: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

17

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

PH II Performance

From low to high Solid: PH II, 1, 2, 4, 8 seeds weight 11 Dashed: Blastn, seed weight 11

Page 18: PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

18

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

Complexity Proof Finding optimal spaced seeds

NP-hard Finding one optimal seed

NP-hard Computing the hit probability of

multiple seeds NP-hard