Basic Local Alignment Search Tool (BLAST)
-
Upload
asiri-wijesinghe -
Category
Software
-
view
550 -
download
7
Transcript of Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST) by Stephen F. Altschul et al J. Mol.
Bio 1990
W.O.K.A.S Wijesinghe, S. D. L Gunawardena, A. A. M Athukorala
CONTENTS
ConclusionResults
MethodsIntroduction
»
»
»
»
Implementation
»
Questions & Answers
»
INTRODUCTION
WHY COMPARE ONE SEQUENCE TO ANOTHER?
Introduction
• Function of a newly sequenced gene or a protein can be
predicted by discovering it’s homology to a known gene or a
protein.
• A major task of Bioinformatics is to find homologous
sequence in a database of sequences
• Databases of DNA and amino acid sequences continue to
grow in size.
• There are number of software tools based on many
algorithms.
Introduction
EVOLUTION OF SIMILARITY SEARCH ALGORITHMS
Introduction
Needleman-Wunsch Algorithm
• Dynamic programming algorithm
• Assign scores to insertions ,deletions and replacements.
• Compute an alignment of two sequences with least mutations.
• Measurement of Similarity
• Because of computational requirements impractical for searching
a large database without a supercomputer
Introduction
Smith Waterman• Rapid Heuristic Algorithm• Allows large databases to be searched on common computers
FastP Algorithm
• David J. Lipman and William R. Pearson in 1985
• First find locally similar regions between 2 sequences based on identities but
not gaps.
• Then rescores those regions using a similarity matrix.
• Quite popular
Introduction
NEED FOR A FAST ALIGNMENT METHOD
Introduction
BLAST(Basic Local Alignment Search Tool) Algorithm
• A new approach to rapid sequence comparison.
• It directly approximate alignments that optimize a measure of
local similarity called the Maximal Segment Pair(MSP) score.
• Recent mathematical results on MSP scores allows
performance analysis of this method
• Generated alignment has a statistical significance.
• Simple & a robust algorithm
Introduction
Can be applied to various contexts
• DNA sequences
• Protein sequences
• Motif searches
• Gene identification searches
• Analysis of multiple similarity regions in long DNA sequences Characteristic features
• Flexibility & Tractability.
• Very much faster than existing sequence comparison tools of
comparable sensitivity.
Introduction
This research paper describes all the methods &
implementations of BLAST algorithm.
METHODS
THE MSP MEASURE
Discussion on Maximal Segment Pairs
METHODS
TWO TYPES OF SIMILARITY
MEASURES
Methods
Global• Optimize the overall alignment of two sequences.• Includes large stretches of low similarity
Local
• Seek only relatively conserved subsequences• Single comparison may yield several distinct
subsequence alignments
• Local similarity measure are preferred for
database searches where cDNAs can be
compared with partially sequence genes.
• Many similarity measures including the one they
describe begins with a matrix of similarity score
for all possible pairs of residues
BLAST
Methods
Scoring Alignments
• Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein)• Identities & conservative replacements >>> Positive Scores• Unlikely Replacements >> Negative scores
• Amino acid sequences : “PAM” matrixBLOSUM
• DNA sequences : match = +5mismatch = -4
• Sequence Segment – Contiguous stretch of residues of any length
• Similarity score for segments is the sum of similarity values for each pair of aligned residues
Methods
Maximal Segment Pair (MSP)
• Given these rules they have defined a Maximal Segment Pair (MSP) which is
the highest scoring pair of identical length segments chosen from 2 sequences.
• The similarity score of an MSP is called the MSP score which is calculated by
BLAST.
• With long sequences the search for the MSP score becomes computationally
demanding.
• Therefore BLAST searches for locally maximal segment pairs – Score cannot be
improved by either extending or shortening segments.
Methods
RAPID APPROXIMATION OF MSP SCORES
METHODS
• Goal is to report those database sequences that have MSP score above
some cutoff score S.
• Statistically the highest MSP score S can be estimated at which “chance
similarities” are likely to appear.
• BLAST minimizes time spent on database sequences whose similarity
with the query has little chance of exceeding this score.
Methods
• Let a word pair be a segment pair with a fixed length w.
• Main strategy: seek only segment pairs (one from database, one query)
that contain a word pair with score at least T.
• Such hit will be extended until it exceeds the cutoff score S and those hits
will be the final output of BLAST.
• Lower T => Fewer false negatives
• Lower T => More pairs to analyze
Methods
IMPLEMENTATION
Key Steps of BLAST Algorithm
METHODS
• BLAST finds locally maximal segment pairs that exceeds a particular cutoff.
• Detailed annotation of Three Algorithmic steps.
• Compile a list of high-scoring words. • Scanning the DB for hits.• Extending hits that meet certain scoring
criteria (Extend only word pairs with a score of at least T to determine if it has a segment pair of score at least S).
METHODS
BLAST
COMPILING OF HIGH SCORING WORDS
• Obtain the list of words in the target sequence (k-mers), that give a score of T or higher when aligned with the query sequence.
• Assume that the query sequence is P Q G E F G •We have the following four 3-mers (recall, the number of k-mers is
always N-k+1):
P Q G Q G E G E F E F G
METHODS
Each of these 3-mers are then scored against each and every one of the k-mers in each of the target sequence. For long sequences, this could well include all 8000 possible k-mers.
• So, of all pairwise scorings for P Q G (using the BLOSUM-62 matrix), we can find the following high scoring ones:
• P Q G (of course, this is a perfect match) score of 7+5+6 = 18 • P E G score of 7+2+6 = 15 • P Q A score of 7+5+0 = 12
METHODS
SCANNING THE DB FOR HITS
• Scan the database for hits with the compiled list of words obtained in previous step.
• How efficiently search a long sequence for multiple occurrences of short sequences.
• BLAST has two approaches• Indexing approach• Finite state machine
METHODS
INDEXING APPROACH – EXAMPLE 1
• Build a lookup table of size |Σ|w for all w-length words in DB.
METHODS
INDEXING APPROACH – EXAMPLE 2
• Let w=3. For amino acids, the number of words is 203.
• Map a word to an integer between 1 and 203.
• Thus a word has an index into an array.
• Each index points to a list of matches of the word in the query sequence.
• As we scan the database, each database word immediately leads to the hits in the query sequence.
METHODS
EXTENDING HITS
• Once the hits are located both in the query and the target sequence, extend the hits to form high scoring segment pairs.
• Find the highest scoring segment (the maximal segment pair) or those whose score exceeds (another user set) threshold S.
• When manage to find a hit (a match between a “word” and a database entry), extend the hit in either direction.
• Keep track of the score (use a scoring matrix).
• Stop when the score drops below some cutoff.
METHODS
EXTENDING HITS – EXAMPLE 1
• Extend each seed on either side until the aggregate alignment score falls below a threshold.
• Un-gapped: Extend by only either matches or mismatches.• Gapped: Extend by matches, mismatches or a limited number of insertion/deletion
gaps.
METHODS
EXTENDING HITS – EXAMPLE 2
METHODS
RESULTS
Evaluating Statistical Significance
RESULTS
• Finally, evaluate the statistical significance of the alignments / scores that exceed the threshold.
• BLAST statistical significance of MSP scores can be evaluated by following factors.
• Performance of BLAST with random sequences• Performance of BLAST with homologous
sequences• Performance comparing long DNA
sequences
RESULTS
BLAST
Performance of BLAST with random sequences
RESULTS
• When two random sequences of length m and n compared, the probability of finding at least one HSP “by chance” is:
• Hence, the probability of finding exactly x HSPs with a score ≥ S is given by:
RESULTS
EeXPXPXP 1)0(1)1(1)1(
!)(
x
EexXP
xE
BLAST
• Where E can be defined by according to the Karlin-Altschul equation. • E(HSPs with score) ≥ S, also called the E-value:
RESULTS
• The probability of finding c or more distinct segment pairs, all with a score of at least S, is given by the formula:
• Utilizing this formula can be detected two sequences that share distinct regions of similarity as significantly related.
RESULTS
1
0 !1)(1)(
c
i
iE
i
EecXPcXP
BLAST
The choice of word length & threshold parameters Necessary & Sufficient Adjustments to be Done in Terms of Word Length & Threshold Value
RESULTS
Time required to execute BLAST• To Compile List of Words.• To Scan the Database for Hits (MSP>T).• To Extend All Hits to Seek Segment Pairs with Scores
Exceeding the Cutoff (HSP>S).• All these three steps depend on W & T
Can we make the process more optimal?• Decrease the time spent on step 3, by increasing the W.
But there are complementary problems created by larger W.
• For Proteins – 20W possible word, Therefor when W increases number of words generated by query grows exponentially. (But number of words increases linearly with the length of the query)
• It increases time spent on step 1 & also the amount of memory required.
Results
BLAST
Optimal T and W values• For protein sequences W=4, T=17.• For DNA sequences W=11.
How W and T affect the performance of BLAST
Results
BLAST
T Execution Time Accuracy Speed
W Execution Time Accuracy Speed
Computational Complexity ?
Performance of BLAST with homologous sequences
Things to be Noted When Query Sequence BLAST Against Set of Homologous Sequences
RESULTS
What is homologous?• Related by common ancestor.
Researchers Example 1• Search for wooly monkey sequence.• When W=4 & T=17.• Found 178 MSPs with scores (50-80).• Random model suggest that BLAST should miss 24 of MSPs• But actual miss 43.• Therefor error = 44.2%
Researchers Example 2• Search for mouse sequences.• Same W & T as previous.• Found 33 MSPs with scores (45-65).• Random model suggest that BLAST should miss 8 of MSPs• But actual miss 2.• Therefor error = -300%
RESULTS
BLAST
• Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.
• The overall performance of BLAST depends on the distribution of MSP scores.
Strengths of BLAST• Great utility is for identify high scoring MSPs quickly.• Takes lower amount of time for the alignment process.
Further improvements can be done• Novel approaches like Position Specific Iterated BLAST
(PSI BLAST)
RESULTS
BLAST
Comparison of two long DNA sequences
Does Adjustments of W Make the Process Faster?
RESULTS
Main Classes of Locally Similar Regions• Genes.• Long interspersed Repeats.• Anticipated Weaker Similarities.
Example (Human Gene VS Rabbit Gene)Step -1 • Match Score = 5, Mismatch Score = -4 & W=12.• 93 Alignments scoring over 200, 57 Alignments scoring
over 350 with 1301 highest score.Step -2• W=8• Only additional 32 alignments are found score over 200.
• Use of Smaller W does not provides new essential information always.
Results
BLAST
Results
BLAST
The Time & Sensitivity of
BLAST on DNA Sequences as
a Function of W
CONCLUSION
Underlying Concept of BLAST• Simple & Robust.• Can be implemented in many ways.• Can be used in variety of contexts.
Researchers Implementation• Used a shared memory version of BLAST.• Why shared memory?• Loads compressed DNA file into memory once & allow
subsequent steps to skip that step.
• BLAST approach permits construction of extremely fast programs for database searching, Which provides additional advantage on mathematical advantage
To Whom This Tool Would Help• Molecular Biologists• Doctors & etc…
Conclusion
BLAST
QUESTIONS & ANSWERS
Q&A
Q&A
• What are the disadvantages of Blast with compared to other heuristic algorithms? – by Pubudu• What is the criteria of selecting threshold S and T values using in
implementation part of this research paper? – by Parinda• What are the chances that the Maximal Segment Pair score for
two unrelated sequences would be greater than or equal to S value? – by Parinda• Among PAM and BLOSUM matrices what is the most suitable matrix
when scoring segments in amino acid sequences ? Why? – by Shashika
THANKS FOR YOUR TIME