The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search...

143
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs

Transcript of The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search...

The Basic Local Alignment Search Tool(BLAST)

Rapid data base search tool (1990)

Idea:

(1) Search for high scoring segment pairs

The Basic Local Alignment Search Tool(BLAST)

A Y W T Y I V A L T – Q V R Q Y E A T

S I L C I V M I Y S R A - Q Y R Y W R Y

Most local alignments contain highly conserved sections without gaps

The Basic Local Alignment Search Tool(BLAST)

A Y W T Y I V A L T – Q V R Q Y E A T

S I L C I V M I Y S R A - Q Y R Y W R Y

-> search for high scoring segment pairs

(HSP), i.e. gap-free local alignments

The Basic Local Alignment Search Tool(BLAST)

The Basic Local Alignment Search Tool(BLAST)

A Y W T Y I V A L T – Q V R Q Y E A T

S I L C I V M I Y S R A - Q Y R Y W R Y

Advantages: (a) speed

(b) statistical theory about HSP exists.

The Basic Local Alignment Search Tool(BLAST)

Rapid data base search tool (1990)

Idea:

(1) Search for high scoring segment pairs

(2) Use word pairs as seeds

Pair-wise sequence alignment

T W L M H C A Q Y I C I M X H X C X T H Y

(1) Search word pairs of length 3 with score > T,Use them as seeds.

Pair-wise sequence alignment

Naïve algorithm would have a complexity of O(l1 * l2)

Solution: Preprocess query sequence:

Compile a list of all words that have a

Score > T when aligned to a word in the

Query.

Pair-wise sequence alignment

Naïve algorithm would have a complexity of O(l1 * l2)

Solution: Preprocess query sequence:

Compile a list of all words that have a

Score > T when aligned to a word in the

Query. Complexity: O(l1)

Organize words in efficient data structure (tree) for fast look-up

The Basic Local Alignment Search Tool(BLAST)

Rapid data base search tool (1990)

Idea:(1) Search for high scoring segment pairs (2) Use word pairs as seeds(3) Extend seed alignments until score drops

below threshold value

Pair-wise sequence alignment

T W L M H C A Q Y I C I M X H X C X T H Y

Extend seeds until score drops by X.

Pair-wise sequence alignment

T W L M H C A Q Y I C I X M X H X C X T X H X Y

Extend seeds until score drops by X.

Pair-wise sequence alignment

Algorithm not guaranteed to find best

segment pair

(Heuristic)

But works well in practice!

The Basic Local Alignment Search Tool(BLAST)

New BLAST version (1997)

Two-hit strategy

Pair-wise sequence alignment

W L M H C A Q Y A R V I M X H X C X T H W A X R X v X

Search two word pairs of at the same diagonal, use lower threshold T

The Basic Local Alignment Search Tool(BLAST)

New BLAST version (1997)

Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST

(PSI BLAST)

The Basic Local Alignment Search Tool(BLAST)

Multiple sequence alignment

1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1 .NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1 .drvrkksga.........awqGQIVGWYctnlt.............peG

1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment

First question: how to score multiple alignments?

Possible scoring scheme:

Sum-of-pairs score

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQtkngqGWVPSNYITPVN

1ycsB 39 WWWARlndkeGYVPRNLLGLYP

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment

Multiple alignment implies pairwise alignments:

Use sum of scores of these p.a.

1aboA 36 WCEAQt..kngqGWVPSNYITPVN......

1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd.....

1vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment

Goal:

Find multi-alignment with maximum score !

Multiple sequence alignment

Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment

Multidimensional search space instead of two-dimensional matrix!

Multiple sequence alignment

Multiple sequence alignment

Complexity:

For sequences of length l1 * l2 * l3

O( l1 * l2 * l3 )

For n sequences ( average length l ):

O( ln )

Exponential complexity!

Multiple sequence alignment

Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment

Optimal solution not feasible:

Multiple sequence alignment

Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment

Optimal solution not feasible:

-> Heuristics necessary

Multiple sequence alignment

(A) Carillo and Lipman (MSA)

Find sub-space in dynamic-programming

Matrix where optimal path can be found

Multiple sequence alignment

(B) Stoye, Dress (DCA)

Divide search space into small Calculate optimal alignment for sub-spaces Concatenate sub-alignments

Multiple sequence alignment

(B) Stoye, Dress (DCA)

Multiple sequence alignment

(B) Stoye, Dress (DCA)

Multiple sequence alignment

Progressive alignment.

Carry out a series of pair-wise alignment

Most popular way of constructing multiple alignments:

Progressive alignment.

Carry out a series of pair-wise alignment

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Align most similar sequences

Multiple sequence alignment

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN

WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD

YAVESEASFQPVAALERIN

WLNYNEERGDFPGTYVEYIGRKKISP

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN

WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD

YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN

WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD

YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP

Align sequence to alignment

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD

YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP

Align alignment to alignment

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP

Rule: “once a gap - always a gap”

Multiple sequence alignment

Order of pair-wise profile alignments determined

by phylogenetic tree based on pair-wise similarity

values (guide tree)

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Multiple sequence alignment

Problem: simple guide tree determines multiple alignment; multiple alignment determines phyolgeneitc analysis

Multiple sequence alignment

Implementations:

Clustal W, PileUp, MultAlin

Local multiple alignment

M

M

Local multiple alignment

M

M

M

Local multiple alignment

M

M

M

Local multiple alignment

Find motifs contained in all sequences in data set

Problem:

motifs often present in only sub-families

Neither local nor global methods appliccable

Alignment possible if order conserved

The DIALIGN approach

The DIALIGN approach

Combination of local and global methods.

The DIALIGN approach

Combination of local and global methods.

Find local pair-wise similarities between input sequences (fragments)

The DIALIGN approach

Combination of local and global methods.

Find local pair-wise similarities between input sequences (fragments)

Compose alignments from fragments

The DIALIGN approach

Combination of local and global methods.

Find local pair-wise similarities between input sequences (fragments)

Compose alignments from fragments

Ignore non-related parts of the sequences

The DIALIGN approach

atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach

atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach

atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach

atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach

atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--

The DIALIGN approach

atctaatagttaaactcccccgtgcttagagatccaaaccagtgcgtgtattactaacggttcaatcgcgcacatccgc

------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--

------atcTAATAGTTAaaccccctcgtGCTTag-------AGATCCaaaccagtgcgtgTATTACTAAc----------GGTTcaatcgcgcACATCCgc--

The DIALIGN approach

Score of an alignment:

Define score of fragment f:

l(f) = length of fs(f) = sum of matches (similarity values)

P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences.

Score w(f) = -ln P(f)

The DIALIGN approach

Score of an alignment:

Define score of alignment as sum of scores w(f) of its fragments

No gap penalty is used!

Optimization problem for pair-wise alignment:

Find chain of fragments with maximal total score

The DIALIGN approach

------atctaatagttaaaccccctcgtgcttag-------agatccaaaccagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--

Fragment-chaining algorithm finds optimal chain of

fragments.

The DIALIGN approach

Multiple fragment alignment

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaac----------ggttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

The DIALIGN approach

Multiple fragment alignment

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Consistency: it is possible to introduce gaps such that all segment pairs are aligned.

The DIALIGN approach

Multiple fragment alignment

atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa

Program evaluation

Use biologically verified alignments

(known 3D structure of proteins)

Compare alignments produced by

computer programs to “biologically correct”

alignments.

Program evaluation

(1) First evaluation of multiple alignment programs (McClure, Vasi, Fitch,1994)

4 protein families used:

Globin, kinase, protease, ribonuclease H,

all globally related -> global programs

performed best

Program evaluation

(2) The BAliBASE (Thompson et al., 1999)

~ 100 protein families with known 3D structure,

some with large insertions/deletions.

Program evaluation

1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1 .NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1 .drvrkksga.........awqGQIVGWYctnlt.............peG

1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

Key

alpha helix RED beta strand GREEN core blocks UNDERSCORE

Program evaluation

Results:

Four programs performed best, but no method was best in all test examples.

ClustalW, SAGA and RPPR best for global alignment,DIALIGN best for sequences with large insertions ordeletions.

Program evaluation

(3) Lassmann and Sonnhammer (2002)

Used BAliBASE plus artificial sequencesfor local alignment

Results: T-COFFEE best for closely related sequences, DIALIGN best for distal sequences.

Program evaluation

Alignment of large genomic sequences

Important tool for identifying functional

sites (e.g. genes or regulatory elements)

Alignment of large genomic sequences

Phylogenetic Footprinting:

Functional sites more conserved during evolution

=> Sequence similarity indicates biological function

Alignment of large genomic sequences

DIALIGN performs well in identifying local homologies, but is slow

Quadratic program running time

Quadratic program running time

Quadratic program running time

Quadratic program running time

Quadratic program running time

Quadratic program running time

Quadratic program running time

Solution: Anchored alignments

Solution: Anchored alignments

Solution: Anchored alignments

Solution: Anchored alignments

Solution: Anchored alignments

Solution: Anchored alignments

Solution: Anchored alignments

Solution: Anchored alignments

Find anchor points to reduce search space

Solution: Anchored alignments

Use fast heuristic method to find anchor points:

CHAOS developed together with Mike Brudno

Brudno et al. (2003), BMC Bioinformatics 4:66

Solution: Anchored alignments

(3) Anchored alignments

(3) Anchored alignments

First step to gene prediction:

Exon discovery by genomic alignment

First step to gene prediction:

Exon discovery by genomic alignment

Evaluation of different alignment programs:

Compare local sequence similarity identified by alignment programs to known exons

Morgenstern et al. (2002), Bioinformatics 18:777-787

DIALIGN alignment of human and murine genomic sequences

DIALIGN alignment of tomato and Thaliana genomic sequences

Evaluation of DIALIGN, PipMaker, WABA, BLASTN and TBLASTX on a set of 42 human and murine genomic sequences.

Compare similarities to annotated exons

Apply cut-off parameter to resulting alignments

Measure sensitivity and specificity

Performance of long-range alignment programs for exon discovery (human - mouse comparison)

Performance of long-range alignment programs for exon discovery (thaliana - tomato comparison)

AGenDA:

Alignment-based Gene Detection Algorithm

Bridge small gaps between DIALIGN fragments

-> cluster of fragments

Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons

Recursive algorithm finds biologically consistent chain of potential exons

Identification of candidate exons

Fragments in DIALIGN alignment

Identification of candidate exons

Build cluster of fragments

Identification of candidate exons

Identify conserved splice sites

Identification of candidate exons

Candidate exons bounded by conserved splice sites

Construct gene models using candidate exons

Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending

Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score

)()()(

),()()( SPscfw

Clen

ECdisClenEsc

i

i

Find optimal consistent chain of candidate exons

Find optimal consistent chain of candidate exons

Find optimal consistent chain of candidate exons

Find optimal consistent chain of candidate exons

Find optimal consistent chain of candidate exons

atg gt ag gt ag tga atg tga

Find optimal consistent chain of candidate exons

atg gt ag gt ag tga atg tga

G1 G2

Find optimal consistent chain of candidate exons

Recursive algorithm calculates optimal chain of candidate exons in N log N time

DIALIGN fragments

Candidate exons

Complete model

Results:105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

0%10%20%30%40%50%60%70%80%90%

100%

sensitivity specificity

AGenDAGenScan

Results:105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

AGenDA

GenScan

64 %

12 % 17 %

Results:

Quality of AGenDA-based gene models comparable to results from GenScan

Exons identified that have not been identified by GenScan

No statistical models derived from known genes (no training data necessary!)

Method generally appliccable

AGenDA:

Alignment-based Gene Detection Algorithm

WWW server:

http://bibiserv/TechFak.Uni-Bielefeld.DE/agenda

Rinner, Taher, Goel, Sczyrba, Brudno, Batzoglou, Morgenstern, submitted