Basic Local Alignment Search Tool (BLAST)

51
Basic Local Alignment Search Tool (BLAST) by Stephen F. Altschul et al J. Mol. Bio 1990 W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala

Transcript of Basic Local Alignment Search Tool (BLAST)

Page 1: Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) by Stephen F. Altschul et al J. Mol.

Bio 1990

W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala

Page 2: Basic Local Alignment Search Tool (BLAST)

CONTENTS

ConclusionResults

MethodsIntroduction

»

»

»

»

Implementation

»

Questions & Answers

»

Page 3: Basic Local Alignment Search Tool (BLAST)

INTRODUCTION

Page 4: Basic Local Alignment Search Tool (BLAST)

WHY COMPARE ONE SEQUENCE TO ANOTHER?

Introduction

Page 5: Basic Local Alignment Search Tool (BLAST)

• Function of a newly sequenced gene or a protein can be

predicted by discovering it’s homology to a known gene or a

protein.

• A major task of Bioinformatics is to find homologous

sequence in a database of sequences

• Databases of DNA and amino acid sequences continue to

grow in size.

• There are number of software tools based on many

algorithms.

Introduction

Page 6: Basic Local Alignment Search Tool (BLAST)

EVOLUTION OF SIMILARITY SEARCH ALGORITHMS

Introduction

Page 7: Basic Local Alignment Search Tool (BLAST)

Needleman-Wunsch Algorithm

• Dynamic programming algorithm

• Assign scores to insertions ,deletions and replacements.

• Compute an alignment of two sequences with least mutations.

• Measurement of Similarity

• Because of computational requirements impractical for searching

a large database without a supercomputer

Introduction

Page 8: Basic Local Alignment Search Tool (BLAST)

Smith Waterman• Rapid Heuristic Algorithm• Allows large databases to be searched on common computers

FastP Algorithm

• David J. Lipman and William R. Pearson in 1985

• First find locally similar regions between 2 sequences based on identities but

not gaps.

• Then rescores those regions using a similarity matrix.

• Quite popular

Introduction

Page 9: Basic Local Alignment Search Tool (BLAST)

NEED FOR A FAST ALIGNMENT METHOD

Introduction

Page 10: Basic Local Alignment Search Tool (BLAST)

BLAST(Basic Local Alignment Search Tool) Algorithm

• A new approach to rapid sequence comparison.

• It directly approximate alignments that optimize a measure of

local similarity called the Maximal Segment Pair(MSP) score.

• Recent mathematical results on MSP scores allows

performance analysis of this method

• Generated alignment has a statistical significance.

• Simple & a robust algorithm

Introduction

Page 11: Basic Local Alignment Search Tool (BLAST)

Can be applied to various contexts

• DNA sequences

• Protein sequences

• Motif searches

• Gene identification searches

• Analysis of multiple similarity regions in long DNA sequences Characteristic features

• Flexibility & Tractability.

• Very much faster than existing sequence comparison tools of

comparable sensitivity.

Introduction

This research paper describes all the methods &

implementations of BLAST algorithm.

Page 12: Basic Local Alignment Search Tool (BLAST)

METHODS

Page 13: Basic Local Alignment Search Tool (BLAST)

THE MSP MEASURE

Discussion on Maximal Segment Pairs

METHODS

Page 14: Basic Local Alignment Search Tool (BLAST)

TWO TYPES OF SIMILARITY

MEASURES

Methods

Global• Optimize the overall alignment of two sequences.• Includes large stretches of low similarity

Local

• Seek only relatively conserved subsequences• Single comparison may yield several distinct

subsequence alignments

Page 15: Basic Local Alignment Search Tool (BLAST)

• Local similarity measure are preferred for

database searches where cDNAs can be

compared with partially sequence genes.

• Many similarity measures including the one they

describe begins with a matrix of similarity score

for all possible pairs of residues

BLAST

Methods

Page 16: Basic Local Alignment Search Tool (BLAST)

Scoring Alignments

• Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein)• Identities & conservative replacements >>> Positive Scores• Unlikely Replacements >> Negative scores

• Amino acid sequences : “PAM” matrixBLOSUM

• DNA sequences : match = +5mismatch = -4

• Sequence Segment – Contiguous stretch of residues of any length

• Similarity score for segments is the sum of similarity values for each pair of aligned residues

Methods

Page 17: Basic Local Alignment Search Tool (BLAST)

Maximal Segment Pair (MSP)

• Given these rules they have defined a Maximal Segment Pair (MSP) which is

the highest scoring pair of identical length segments chosen from 2 sequences.

• The similarity score of an MSP is called the MSP score which is calculated by

BLAST.

• With long sequences the search for the MSP score becomes computationally

demanding.

• Therefore BLAST searches for locally maximal segment pairs – Score cannot be

improved by either extending or shortening segments.

Methods

Page 18: Basic Local Alignment Search Tool (BLAST)

RAPID APPROXIMATION OF MSP SCORES

METHODS

Page 19: Basic Local Alignment Search Tool (BLAST)

• Goal is to report those database sequences that have MSP score above

some cutoff score S.

• Statistically the highest MSP score S can be estimated at which “chance

similarities” are likely to appear.

• BLAST minimizes time spent on database sequences whose similarity

with the query has little chance of exceeding this score.

Methods

Page 20: Basic Local Alignment Search Tool (BLAST)

• Let a word pair be a segment pair with a fixed length w.

• Main strategy: seek only segment pairs (one from database, one query)

that contain a word pair with score at least T.

• Such hit will be extended until it exceeds the cutoff score S and those hits

will be the final output of BLAST.

• Lower T => Fewer false negatives

• Lower T => More pairs to analyze

Methods

Page 21: Basic Local Alignment Search Tool (BLAST)

IMPLEMENTATION

Key Steps of BLAST Algorithm

METHODS

Page 22: Basic Local Alignment Search Tool (BLAST)

• BLAST finds locally maximal segment pairs that exceeds a particular cutoff.

• Detailed annotation of Three Algorithmic steps.

• Compile a list of high-scoring words. • Scanning the DB for hits.• Extending hits that meet certain scoring

criteria (Extend only word pairs with a score of at least T to determine if it has a segment pair of score at least S).

METHODS

BLAST

Page 23: Basic Local Alignment Search Tool (BLAST)

COMPILING OF HIGH SCORING WORDS

• Obtain the list of words in the target sequence (k-mers), that give a score of T or higher when aligned with the query sequence.

• Assume that the query sequence is P Q G E F G •We have the following four 3-mers (recall, the number of k-mers is

always N-k+1):

P Q G Q G E G E F E F G

METHODS

Each of these 3-mers are then scored against each and every one of the k-mers in each of the target sequence. For long sequences, this could well include all 8000 possible k-mers.

Page 24: Basic Local Alignment Search Tool (BLAST)

• So, of all pairwise scorings for P Q G (using the BLOSUM-62 matrix), we can find the following high scoring ones:

• P Q G (of course, this is a perfect match) score of 7+5+6 = 18 • P E G score of 7+2+6 = 15 • P Q A score of 7+5+0 = 12

METHODS

Page 25: Basic Local Alignment Search Tool (BLAST)

SCANNING THE DB FOR HITS

• Scan the database for hits with the compiled list of words obtained in previous step.

• How efficiently search a long sequence for multiple occurrences of short sequences.

• BLAST has two approaches• Indexing approach• Finite state machine

METHODS

Page 26: Basic Local Alignment Search Tool (BLAST)

INDEXING APPROACH – EXAMPLE 1

• Build a lookup table of size |Σ|w for all w-length words in DB.

METHODS

Page 27: Basic Local Alignment Search Tool (BLAST)

INDEXING APPROACH – EXAMPLE 2

• Let w=3. For amino acids, the number of words is 203.

• Map a word to an integer between 1 and 203.

• Thus a word has an index into an array.

• Each index points to a list of matches of the word in the query sequence.

• As we scan the database, each database word immediately leads to the hits in the query sequence.

METHODS

Page 28: Basic Local Alignment Search Tool (BLAST)

EXTENDING HITS

• Once the hits are located both in the query and the target sequence, extend the hits to form high scoring segment pairs.

• Find the highest scoring segment (the maximal segment pair) or those whose score exceeds (another user set) threshold S.

• When manage to find a hit (a match between a “word” and a database entry), extend the hit in either direction.

• Keep track of the score (use a scoring matrix).

• Stop when the score drops below some cutoff.

METHODS

Page 29: Basic Local Alignment Search Tool (BLAST)

EXTENDING HITS – EXAMPLE 1

• Extend each seed on either side until the aggregate alignment score falls below a threshold.

• Un-gapped: Extend by only either matches or mismatches.• Gapped: Extend by matches, mismatches or a limited number of insertion/deletion

gaps.

METHODS

Page 30: Basic Local Alignment Search Tool (BLAST)

EXTENDING HITS – EXAMPLE 2

METHODS

Page 31: Basic Local Alignment Search Tool (BLAST)

RESULTS

Page 32: Basic Local Alignment Search Tool (BLAST)

Evaluating Statistical Significance

RESULTS

Page 33: Basic Local Alignment Search Tool (BLAST)

• Finally, evaluate the statistical significance of the alignments / scores that exceed the threshold.

• BLAST statistical significance of MSP scores can be evaluated by following factors.

• Performance of BLAST with random sequences• Performance of BLAST with homologous

sequences• Performance comparing long DNA

sequences

RESULTS

BLAST

Page 34: Basic Local Alignment Search Tool (BLAST)

Performance of BLAST with random sequences

RESULTS

Page 35: Basic Local Alignment Search Tool (BLAST)

• When two random sequences of length m and n compared, the probability of finding at least one HSP “by chance” is:

• Hence, the probability of finding exactly x HSPs with a score ≥ S is given by:

RESULTS

EeXPXPXP 1)0(1)1(1)1(

!)(

x

EexXP

xE

BLAST

Page 36: Basic Local Alignment Search Tool (BLAST)

• Where E can be defined by according to the Karlin-Altschul equation. • E(HSPs with score) ≥ S, also called the E-value:

RESULTS

Page 37: Basic Local Alignment Search Tool (BLAST)

• The probability of finding c or more distinct segment pairs, all with a score of at least S, is given by the formula:

• Utilizing this formula can be detected two sequences that share distinct regions of similarity as significantly related.

RESULTS

1

0 !1)(1)(

c

i

iE

i

EecXPcXP

BLAST

Page 38: Basic Local Alignment Search Tool (BLAST)

The choice of word length & threshold parameters Necessary & Sufficient Adjustments to be Done in Terms of Word Length & Threshold Value

RESULTS

Page 39: Basic Local Alignment Search Tool (BLAST)

Time required to execute BLAST• To Compile List of Words.• To Scan the Database for Hits (MSP>T).• To Extend All Hits to Seek Segment Pairs with Scores

Exceeding the Cutoff (HSP>S).• All these three steps depend on W & T

Can we make the process more optimal?• Decrease the time spent on step 3, by increasing the W.

But there are complementary problems created by larger W.

• For Proteins – 20W possible word, Therefor when W increases number of words generated by query grows exponentially. (But number of words increases linearly with the length of the query)

• It increases time spent on step 1 & also the amount of memory required.

Results

BLAST

Page 40: Basic Local Alignment Search Tool (BLAST)

Optimal T and W values• For protein sequences W=4, T=17.• For DNA sequences W=11.

How W and T affect the performance of BLAST

Results

BLAST

T Execution Time Accuracy Speed

W Execution Time Accuracy Speed

Computational Complexity ?

Page 41: Basic Local Alignment Search Tool (BLAST)

Performance of BLAST with homologous sequences

Things to be Noted When Query Sequence BLAST Against Set of Homologous Sequences

RESULTS

Page 42: Basic Local Alignment Search Tool (BLAST)

What is homologous?• Related by common ancestor.

Researchers Example 1• Search for wooly monkey sequence.• When W=4 & T=17.• Found 178 MSPs with scores (50-80).• Random model suggest that BLAST should miss 24 of MSPs• But actual miss 43.• Therefor error = 44.2%

Researchers Example 2• Search for mouse sequences.• Same W & T as previous.• Found 33 MSPs with scores (45-65).• Random model suggest that BLAST should miss 8 of MSPs• But actual miss 2.• Therefor error = -300%

RESULTS

BLAST

Page 43: Basic Local Alignment Search Tool (BLAST)

• Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

• The overall performance of BLAST depends on the distribution of MSP scores.

Strengths of BLAST• Great utility is for identify high scoring MSPs quickly.• Takes lower amount of time for the alignment process.

Further improvements can be done• Novel approaches like Position Specific Iterated BLAST

(PSI BLAST)

RESULTS

BLAST

Page 44: Basic Local Alignment Search Tool (BLAST)

Comparison of two long DNA sequences

Does Adjustments of W Make the Process Faster?

RESULTS

Page 45: Basic Local Alignment Search Tool (BLAST)

Main Classes of Locally Similar Regions• Genes.• Long interspersed Repeats.• Anticipated Weaker Similarities.

Example (Human Gene VS Rabbit Gene)Step -1 • Match Score = 5, Mismatch Score = -4 & W=12.• 93 Alignments scoring over 200, 57 Alignments scoring

over 350 with 1301 highest score.Step -2• W=8• Only additional 32 alignments are found score over 200.

• Use of Smaller W does not provides new essential information always.

Results

BLAST

Page 46: Basic Local Alignment Search Tool (BLAST)

Results

BLAST

The Time & Sensitivity of

BLAST on DNA Sequences as

a Function of W

Page 47: Basic Local Alignment Search Tool (BLAST)

CONCLUSION

Page 48: Basic Local Alignment Search Tool (BLAST)

Underlying Concept of BLAST• Simple & Robust.• Can be implemented in many ways.• Can be used in variety of contexts.

Researchers Implementation• Used a shared memory version of BLAST.• Why shared memory?• Loads compressed DNA file into memory once & allow

subsequent steps to skip that step.

• BLAST approach permits construction of extremely fast programs for database searching, Which provides additional advantage on mathematical advantage

To Whom This Tool Would Help• Molecular Biologists• Doctors & etc…

Conclusion

BLAST

Page 49: Basic Local Alignment Search Tool (BLAST)

QUESTIONS & ANSWERS

Page 50: Basic Local Alignment Search Tool (BLAST)

Q&A

Q&A

• What are the disadvantages of Blast with compared to other heuristic algorithms? – by Pubudu• What is the criteria of selecting threshold S and T values using in

implementation part of this research paper? – by Parinda• What are the chances that the Maximal Segment Pair score for

two unrelated sequences would be greater than or equal to S value? – by Parinda• Among PAM and BLOSUM matrices what is the most suitable matrix

when scoring segments in amino acid sequences ? Why? – by Shashika

Page 51: Basic Local Alignment Search Tool (BLAST)

THANKS FOR YOUR TIME