Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) by Stephen F. Altschul et al J. Mol.

Bio 1990

W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala

CONTENTS

ConclusionResults

MethodsIntroduction

»

»

»

»

Implementation

»

Questions & Answers

»

INTRODUCTION

WHY COMPARE ONE SEQUENCE TO ANOTHER?

Introduction

• Function of a newly sequenced gene or a protein can be

predicted by discovering it’s homology to a known gene or a

protein.

• A major task of Bioinformatics is to find homologous

sequence in a database of sequences

• Databases of DNA and amino acid sequences continue to

grow in size.

• There are number of software tools based on many

algorithms.

Introduction

EVOLUTION OF SIMILARITY SEARCH ALGORITHMS

Introduction

Needleman-Wunsch Algorithm

• Dynamic programming algorithm

• Assign scores to insertions ,deletions and replacements.

• Compute an alignment of two sequences with least mutations.

• Measurement of Similarity

• Because of computational requirements impractical for searching

a large database without a supercomputer

Introduction

Smith Waterman• Rapid Heuristic Algorithm• Allows large databases to be searched on common computers

FastP Algorithm

• David J. Lipman and William R. Pearson in 1985

• First find locally similar regions between 2 sequences based on identities but

not gaps.

• Then rescores those regions using a similarity matrix.

• Quite popular

Introduction

NEED FOR A FAST ALIGNMENT METHOD

Introduction

BLAST(Basic Local Alignment Search Tool) Algorithm

• A new approach to rapid sequence comparison.

• It directly approximate alignments that optimize a measure of

local similarity called the Maximal Segment Pair(MSP) score.

• Recent mathematical results on MSP scores allows

performance analysis of this method

• Generated alignment has a statistical significance.

• Simple & a robust algorithm

Introduction

Can be applied to various contexts

• DNA sequences

• Protein sequences

• Motif searches

• Gene identification searches

• Analysis of multiple similarity regions in long DNA sequences Characteristic features

• Flexibility & Tractability.

• Very much faster than existing sequence comparison tools of

comparable sensitivity.

Introduction

This research paper describes all the methods &

implementations of BLAST algorithm.

METHODS

THE MSP MEASURE

Discussion on Maximal Segment Pairs

METHODS

TWO TYPES OF SIMILARITY

MEASURES

Methods

Global• Optimize the overall alignment of two sequences.• Includes large stretches of low similarity

Local

• Seek only relatively conserved subsequences• Single comparison may yield several distinct

subsequence alignments

• Local similarity measure are preferred for

database searches where cDNAs can be

compared with partially sequence genes.

• Many similarity measures including the one they

describe begins with a matrix of similarity score

for all possible pairs of residues

BLAST

Methods

Scoring Alignments

• Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein)• Identities & conservative replacements >>> Positive Scores• Unlikely Replacements >> Negative scores

• Amino acid sequences : “PAM” matrixBLOSUM

• DNA sequences : match = +5mismatch = -4

• Sequence Segment – Contiguous stretch of residues of any length

• Similarity score for segments is the sum of similarity values for each pair of aligned residues

Methods

Maximal Segment Pair (MSP)

• Given these rules they have defined a Maximal Segment Pair (MSP) which is

the highest scoring pair of identical length segments chosen from 2 sequences.

• The similarity score of an MSP is called the MSP score which is calculated by

BLAST.

• With long sequences the search for the MSP score becomes computationally

demanding.

• Therefore BLAST searches for locally maximal segment pairs – Score cannot be

improved by either extending or shortening segments.

Methods

RAPID APPROXIMATION OF MSP SCORES

METHODS

• Goal is to report those database sequences that have MSP score above

some cutoff score S.

• Statistically the highest MSP score S can be estimated at which “chance

similarities” are likely to appear.

• BLAST minimizes time spent on database sequences whose similarity

with the query has little chance of exceeding this score.

Methods

• Let a word pair be a segment pair with a fixed length w.

• Main strategy: seek only segment pairs (one from database, one query)

that contain a word pair with score at least T.

• Such hit will be extended until it exceeds the cutoff score S and those hits

will be the final output of BLAST.

• Lower T => Fewer false negatives

• Lower T => More pairs to analyze

Methods

IMPLEMENTATION

Key Steps of BLAST Algorithm

METHODS

• BLAST finds locally maximal segment pairs that exceeds a particular cutoff.

• Detailed annotation of Three Algorithmic steps.

• Compile a list of high-scoring words. • Scanning the DB for hits.• Extending hits that meet certain scoring

criteria (Extend only word pairs with a score of at least T to determine if it has a segment pair of score at least S).

METHODS

BLAST

COMPILING OF HIGH SCORING WORDS

• Obtain the list of words in the target sequence (k-mers), that give a score of T or higher when aligned with the query sequence.

• Assume that the query sequence is P Q G E F G •We have the following four 3-mers (recall, the number of k-mers is

always N-k+1):

P Q G Q G E G E F E F G

METHODS

Each of these 3-mers are then scored against each and every one of the k-mers in each of the target sequence. For long sequences, this could well include all 8000 possible k-mers.

• So, of all pairwise scorings for P Q G (using the BLOSUM-62 matrix), we can find the following high scoring ones:

• P Q G (of course, this is a perfect match) score of 7+5+6 = 18 • P E G score of 7+2+6 = 15 • P Q A score of 7+5+0 = 12

METHODS

SCANNING THE DB FOR HITS

• Scan the database for hits with the compiled list of words obtained in previous step.

• How efficiently search a long sequence for multiple occurrences of short sequences.

• BLAST has two approaches• Indexing approach• Finite state machine

METHODS

INDEXING APPROACH – EXAMPLE 1

• Build a lookup table of size |Σ|w for all w-length words in DB.

METHODS

INDEXING APPROACH – EXAMPLE 2

• Let w=3. For amino acids, the number of words is 203.

• Map a word to an integer between 1 and 203.

• Thus a word has an index into an array.

• Each index points to a list of matches of the word in the query sequence.

• As we scan the database, each database word immediately leads to the hits in the query sequence.

METHODS

EXTENDING HITS

• Once the hits are located both in the query and the target sequence, extend the hits to form high scoring segment pairs.

• Find the highest scoring segment (the maximal segment pair) or those whose score exceeds (another user set) threshold S.

• When manage to find a hit (a match between a “word” and a database entry), extend the hit in either direction.

• Keep track of the score (use a scoring matrix).

• Stop when the score drops below some cutoff.

METHODS

EXTENDING HITS – EXAMPLE 1

• Extend each seed on either side until the aggregate alignment score falls below a threshold.

• Un-gapped: Extend by only either matches or mismatches.• Gapped: Extend by matches, mismatches or a limited number of insertion/deletion

gaps.

METHODS

EXTENDING HITS – EXAMPLE 2

METHODS

RESULTS

Evaluating Statistical Significance

RESULTS

• Finally, evaluate the statistical significance of the alignments / scores that exceed the threshold.

• BLAST statistical significance of MSP scores can be evaluated by following factors.

• Performance of BLAST with random sequences• Performance of BLAST with homologous

sequences• Performance comparing long DNA

sequences

RESULTS

BLAST

Performance of BLAST with random sequences

RESULTS

• When two random sequences of length m and n compared, the probability of finding at least one HSP “by chance” is:

• Hence, the probability of finding exactly x HSPs with a score ≥ S is given by:

RESULTS

EeXPXPXP 1)0(1)1(1)1(

!)(

x

EexXP

xE

BLAST

• Where E can be defined by according to the Karlin-Altschul equation. • E(HSPs with score) ≥ S, also called the E-value:

RESULTS

• The probability of finding c or more distinct segment pairs, all with a score of at least S, is given by the formula:

• Utilizing this formula can be detected two sequences that share distinct regions of similarity as significantly related.

RESULTS

1

0 !1)(1)(

c

i

iE

i

EecXPcXP

BLAST

The choice of word length & threshold parameters Necessary & Sufficient Adjustments to be Done in Terms of Word Length & Threshold Value

RESULTS

Time required to execute BLAST• To Compile List of Words.• To Scan the Database for Hits (MSP>T).• To Extend All Hits to Seek Segment Pairs with Scores

Exceeding the Cutoff (HSP>S).• All these three steps depend on W & T

Can we make the process more optimal?• Decrease the time spent on step 3, by increasing the W.

But there are complementary problems created by larger W.

• For Proteins – 20W possible word, Therefor when W increases number of words generated by query grows exponentially. (But number of words increases linearly with the length of the query)

• It increases time spent on step 1 & also the amount of memory required.

Results

BLAST

Optimal T and W values• For protein sequences W=4, T=17.• For DNA sequences W=11.

How W and T affect the performance of BLAST

Results

BLAST

T Execution Time Accuracy Speed

W Execution Time Accuracy Speed

Computational Complexity ?

Performance of BLAST with homologous sequences

Things to be Noted When Query Sequence BLAST Against Set of Homologous Sequences

RESULTS

What is homologous?• Related by common ancestor.

Researchers Example 1• Search for wooly monkey sequence.• When W=4 & T=17.• Found 178 MSPs with scores (50-80).• Random model suggest that BLAST should miss 24 of MSPs• But actual miss 43.• Therefor error = 44.2%

Researchers Example 2• Search for mouse sequences.• Same W & T as previous.• Found 33 MSPs with scores (45-65).• Random model suggest that BLAST should miss 8 of MSPs• But actual miss 2.• Therefor error = -300%

RESULTS

BLAST

• Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

• The overall performance of BLAST depends on the distribution of MSP scores.

Strengths of BLAST• Great utility is for identify high scoring MSPs quickly.• Takes lower amount of time for the alignment process.

Further improvements can be done• Novel approaches like Position Specific Iterated BLAST

(PSI BLAST)

RESULTS

BLAST

Comparison of two long DNA sequences

Does Adjustments of W Make the Process Faster?

RESULTS

Main Classes of Locally Similar Regions• Genes.• Long interspersed Repeats.• Anticipated Weaker Similarities.

Example (Human Gene VS Rabbit Gene)Step -1 • Match Score = 5, Mismatch Score = -4 & W=12.• 93 Alignments scoring over 200, 57 Alignments scoring

over 350 with 1301 highest score.Step -2• W=8• Only additional 32 alignments are found score over 200.

• Use of Smaller W does not provides new essential information always.

Results

BLAST

Results

BLAST

The Time & Sensitivity of

BLAST on DNA Sequences as

a Function of W

CONCLUSION

Underlying Concept of BLAST• Simple & Robust.• Can be implemented in many ways.• Can be used in variety of contexts.

Researchers Implementation• Used a shared memory version of BLAST.• Why shared memory?• Loads compressed DNA file into memory once & allow

subsequent steps to skip that step.

• BLAST approach permits construction of extremely fast programs for database searching, Which provides additional advantage on mathematical advantage

To Whom This Tool Would Help• Molecular Biologists• Doctors & etc…

Conclusion

BLAST

QUESTIONS & ANSWERS

Q&A

Q&A

• What are the disadvantages of Blast with compared to other heuristic algorithms? – by Pubudu• What is the criteria of selecting threshold S and T values using in

implementation part of this research paper? – by Parinda• What are the chances that the Maximal Segment Pair score for

two unrelated sequences would be greater than or equal to S value? – by Parinda• Among PAM and BLOSUM matrices what is the most suitable matrix

when scoring segments in amino acid sequences ? Why? – by Shashika

THANKS FOR YOUR TIME

Basic Local Alignment Search Tool (BLAST)

Software

Transcript of Basic Local Alignment Search Tool (BLAST)