BLAST: Basic Local Alignment Search Tool...
Transcript of BLAST: Basic Local Alignment Search Tool...
BLAST:Basic Local Alignment Search Tool
Altschul et al. J. Mol Bio. 1990.
CS 466Saurabh Sinha
Motivation
• Sequence homology to a known proteinsuggest function of newly sequencedprotein
• Bioinformatics task is to findhomologous sequence in a database ofsequences
• Databases of sequences growing fast
Alignment
• Natural approach to check if the “querysequence” is homologous to asequence in the database is to computealignment score of the two sequences
• Alignment score counts gaps(insertions, deletions) and replacements
• Minimizing the evolutionary distance
Alignment
• Global alignment: optimize the overallsimilarity of the two sequences
• Local alignment: find only relativelyconserved subsequences
• Local similarity measures preferred fordatabase searches– Distantly related proteins may only share isolated
regions of similarity
Alignment
• Dynamic programming is the standardapproach to sequence alignment
• Algorithm is quadratic in length of thetwo sequences
• Not practical for searches against verylarge database of sequences (e.g.,whole genome)
Scoring alignments
• Scoring matrix: 4 x 4 matrix (DNA) or 20x 20 matrix (protein)
• Amino acid sequences: “PAM” matrix– Consider amino acid sequence alignment for
very closely related proteins, extractreplacement frequencies (probabilities),extrapolate to greater evolutionary distances
• DNA sequences: match = +5, mismatch= -4
BLAST: the MSP
• Given two sequences of same length, thesimilarity score of their alignment (withoutgaps) is the sum of similarity values for eachpair of aligned residues
• Maximal segment pair (MSP): Highest scoringpair of identical length segments from the twosequences
• The similarity score of an MSP is called theMSP score
• BLAST heuristically aims to find this
Locally maximal segment pair
• A molecular biologist may be interested in allconserved regions shared by two proteins, not justtheir highest scoring pair
• A segment pair (segments of identical lengths) islocally maximal if its score cannot be improved byextending or shortening in either direction
• BLAST attempts to find all locally maximal segmentpairs above some score cutoff.
Rapid approximation of MSP score
• Goal is to report those database sequences thathave MSP score above some threshold S.
• Statistics tells us what is the highest threshold Sat which “chance similarities” are likely to appear
• Tractability to statistical analysis is one of theattractive features of the MSP score
Rapid approximation of MSP score
• BLAST minimizes time spent on database sequenceswhose similarity with the query has little chance ofexceeding this cutoff S.
• Main strategy: seek only segment pairs (one fromdatabase, one query) that contain a word pair withscore >= T
• Intuition: If the sequence pair has to score above S,its most well matched word (of some predeterminedsmall length) must score above T
• Lower T => Fewer false negatives• Lower T => More pairs to analyze
Implementation
1. Compile a list of high scoring words2. Scan database for hits to this word list3. Extend hits
Step 1: Compiling list of wordsfrom query sequence
• For proteins: List of all w-length words thatscore at least T when compared to someword in query sequence
• Question: Does every word in the querysequence make it to the list?
• For DNA: list of all w-length words in thequery sequence, often with w=12
Step 2: Scanning thedatabase for hits
• Find exact matches to list words• Can be done in linear time
– two methods (next slides)• Each word in list points to all
occurrences of the word in word listfrom previous step
Scanning the database for hits
• Method 1: Let w=4, so 204 possible words• Each integer in 0 … 204-1 is an index for an
array• Array element point to list of all occurrences
of that word in query• Not all 204 elements of array are populated
– only the ones in word list from previousstep
Scanning the database for hits
• Method 2: use “deterministic finiteautomaton” or “finite state machine”.
• Similar to the keyword trees seen incourse.
• Build the finite state machine out of allwords in word list from previous step
Step 3: Extending hits
• Once a word pair with score >= T has beenfound, extend it in each direction.
• Extend until score >= S is obtained• During extension, score may go up, and then
down, and then up again• Terminate if it goes down too much (a certain
distance below the best score found forshorter extensions)
• One implementation allows gaps duringextension
BLAST: approximating the MSP
• BLAST may not find all segment pairsabove threshold S
• Trying to approximate the MSP• Bounds on the error: not hard bounds,
but statistical bounds– “Highly likely” to find the MSP
Statistics
• Suppose the MSP has been calculated byBLAST (and suppose this is the true MSP)
• Suppose this observed MSP scores S.• What are the chances that the MSP score for
two unrelated sequences would be >= S?• If the chances are very low, then we can be
confident that the two sequences must nothave been unrelated
Statistics
• Given two random sequences oflengths m and n
• Probability that they will produce anMSP score of >= x ?
Statistics
• Number of separate SPs with score >= x is Poissondistributed with mean y(x) = Kmn exp(-λx), where
• λ is the positive solution of∑pipjexp(λs(i,j)) = 1
• K is a constant• s(i,j) is the scoring matrix, pi is the frequency of
i in random sequences
Statistics• Poisson distribution:
Pr(x) = (e- λ λx)/x!• Pr(#SPs >= α)= 1 - Pr(#SPs <= α-1)
!
=1"e"yyi
i!i= 0
#"1
$
=1" e"yyi
i!i= 0
#"1
$
Statistics• For α=1, Pr(#SPs >= 1) = 1-e-y(x)
• Choose S such that 1-e-y(S) is small• Suppose the probability of having at least 1 SP with
score >= S is 0.001.• This seems reasonably small• However, if you test 10000 random sequences, you
expect 10 to cross the threshold• Therefore, require “E-value” to be small.• That is, expected number of random sequence pairs
with score >= S should be small.
More statistics
• We just saw how to choose threshold S• How to choose T ?• BLAST is trying to find segment pairs
(SPs) scoring above S• If an SP scores S, what is the
probability that it will have a w-wordmatch of score T or more?
• We want this probability to be high
More statistics: Choosing T
• Given a segment pair (from two randomsequences) that scores S, what is theprobability q that it will have no w-wordmatch scoring above T?
• Want this q to be low• Obtained from simulations• Found to decrease exponentially as S
increases
BLAST is the universally usedbioinformatics tool
http://flybase.org/blast/