BLAST: Basic Local Alignment Search Tool...

26
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Transcript of BLAST: Basic Local Alignment Search Tool...

Page 1: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST:Basic Local Alignment Search Tool

Altschul et al. J. Mol Bio. 1990.

CS 466Saurabh Sinha

Page 2: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Motivation

• Sequence homology to a known proteinsuggest function of newly sequencedprotein

• Bioinformatics task is to findhomologous sequence in a database ofsequences

• Databases of sequences growing fast

Page 3: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Alignment

• Natural approach to check if the “querysequence” is homologous to asequence in the database is to computealignment score of the two sequences

• Alignment score counts gaps(insertions, deletions) and replacements

• Minimizing the evolutionary distance

Page 4: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Alignment

• Global alignment: optimize the overallsimilarity of the two sequences

• Local alignment: find only relativelyconserved subsequences

• Local similarity measures preferred fordatabase searches– Distantly related proteins may only share isolated

regions of similarity

Page 5: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Alignment

• Dynamic programming is the standardapproach to sequence alignment

• Algorithm is quadratic in length of thetwo sequences

• Not practical for searches against verylarge database of sequences (e.g.,whole genome)

Page 6: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Scoring alignments

• Scoring matrix: 4 x 4 matrix (DNA) or 20x 20 matrix (protein)

• Amino acid sequences: “PAM” matrix– Consider amino acid sequence alignment for

very closely related proteins, extractreplacement frequencies (probabilities),extrapolate to greater evolutionary distances

• DNA sequences: match = +5, mismatch= -4

Page 7: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST: the MSP

• Given two sequences of same length, thesimilarity score of their alignment (withoutgaps) is the sum of similarity values for eachpair of aligned residues

• Maximal segment pair (MSP): Highest scoringpair of identical length segments from the twosequences

• The similarity score of an MSP is called theMSP score

• BLAST heuristically aims to find this

Page 8: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Locally maximal segment pair

• A molecular biologist may be interested in allconserved regions shared by two proteins, not justtheir highest scoring pair

• A segment pair (segments of identical lengths) islocally maximal if its score cannot be improved byextending or shortening in either direction

• BLAST attempts to find all locally maximal segmentpairs above some score cutoff.

Page 9: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Rapid approximation of MSP score

• Goal is to report those database sequences thathave MSP score above some threshold S.

• Statistics tells us what is the highest threshold Sat which “chance similarities” are likely to appear

• Tractability to statistical analysis is one of theattractive features of the MSP score

Page 10: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Rapid approximation of MSP score

• BLAST minimizes time spent on database sequenceswhose similarity with the query has little chance ofexceeding this cutoff S.

• Main strategy: seek only segment pairs (one fromdatabase, one query) that contain a word pair withscore >= T

• Intuition: If the sequence pair has to score above S,its most well matched word (of some predeterminedsmall length) must score above T

• Lower T => Fewer false negatives• Lower T => More pairs to analyze

Page 11: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Implementation

1. Compile a list of high scoring words2. Scan database for hits to this word list3. Extend hits

Page 12: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Step 1: Compiling list of wordsfrom query sequence

• For proteins: List of all w-length words thatscore at least T when compared to someword in query sequence

• Question: Does every word in the querysequence make it to the list?

• For DNA: list of all w-length words in thequery sequence, often with w=12

Page 13: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Step 2: Scanning thedatabase for hits

• Find exact matches to list words• Can be done in linear time

– two methods (next slides)• Each word in list points to all

occurrences of the word in word listfrom previous step

Page 14: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Scanning the database for hits

• Method 1: Let w=4, so 204 possible words• Each integer in 0 … 204-1 is an index for an

array• Array element point to list of all occurrences

of that word in query• Not all 204 elements of array are populated

– only the ones in word list from previousstep

Page 15: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Scanning the database for hits

• Method 2: use “deterministic finiteautomaton” or “finite state machine”.

• Similar to the keyword trees seen incourse.

• Build the finite state machine out of allwords in word list from previous step

Page 16: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Step 3: Extending hits

• Once a word pair with score >= T has beenfound, extend it in each direction.

• Extend until score >= S is obtained• During extension, score may go up, and then

down, and then up again• Terminate if it goes down too much (a certain

distance below the best score found forshorter extensions)

• One implementation allows gaps duringextension

Page 17: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST: approximating the MSP

• BLAST may not find all segment pairsabove threshold S

• Trying to approximate the MSP• Bounds on the error: not hard bounds,

but statistical bounds– “Highly likely” to find the MSP

Page 18: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics

• Suppose the MSP has been calculated byBLAST (and suppose this is the true MSP)

• Suppose this observed MSP scores S.• What are the chances that the MSP score for

two unrelated sequences would be >= S?• If the chances are very low, then we can be

confident that the two sequences must nothave been unrelated

Page 19: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics

• Given two random sequences oflengths m and n

• Probability that they will produce anMSP score of >= x ?

Page 20: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics

• Number of separate SPs with score >= x is Poissondistributed with mean y(x) = Kmn exp(-λx), where

• λ is the positive solution of∑pipjexp(λs(i,j)) = 1

• K is a constant• s(i,j) is the scoring matrix, pi is the frequency of

i in random sequences

Page 21: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics• Poisson distribution:

Pr(x) = (e- λ λx)/x!• Pr(#SPs >= α)= 1 - Pr(#SPs <= α-1)

!

=1"e"yyi

i!i= 0

#"1

$

=1" e"yyi

i!i= 0

#"1

$

Page 22: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics• For α=1, Pr(#SPs >= 1) = 1-e-y(x)

• Choose S such that 1-e-y(S) is small• Suppose the probability of having at least 1 SP with

score >= S is 0.001.• This seems reasonably small• However, if you test 10000 random sequences, you

expect 10 to cross the threshold• Therefore, require “E-value” to be small.• That is, expected number of random sequence pairs

with score >= S should be small.

Page 23: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

More statistics

• We just saw how to choose threshold S• How to choose T ?• BLAST is trying to find segment pairs

(SPs) scoring above S• If an SP scores S, what is the

probability that it will have a w-wordmatch of score T or more?

• We want this probability to be high

Page 24: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

More statistics: Choosing T

• Given a segment pair (from two randomsequences) that scores S, what is theprobability q that it will have no w-wordmatch scoring above T?

• Want this q to be low• Obtained from simulations• Found to decrease exponentially as S

increases

Page 25: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST is the universally usedbioinformatics tool

Page 26: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

http://flybase.org/blast/