PAM BLOSUM

04/10/23 1

Scoring Matrices, Database Searching and Heuristic Alignment Algorithms

ISSP 2081 / BIOINF 2051

Fall 2002

Lecture #6

2

Handouts on Weight matrices Weight matrices for sequence similarity sc

oring by David Wheeler

Supplement to above by David Wheeler

3

PAM Matrices First substitution matrices widely used Based on the point-accepted-mutation

(PAM) model of evolution (Dayhoff..1978) PAMs are relative measures of

evolutionary distance 1 PAM = 1 accepted mutation per 100 AAs Does not mean that after 100 PAMs every

AA will be different? Why or why not?

4

PAM Matrices If changes were purely random

Frequency of each possible substitution is proportional to background frequencies

In related proteins: Observed substitution frequencies called the target

(replacement) frequencies are biased toward those that do not seriously disrupt the protein’s function

These point mutations are “accepted” during evolution Log-odds approach:

Scores proportional to the natural log of the ratio of target frequencies to background frequencies

5

The Math Score matrix entry for time t given by:

s(a,b|t) = log P(b|a,t)

qb

Conditional probability that a is substituted by b in time t

Frequency of amino acid b

6

PAM Matrices Construction Pairs of very closely related sequences used to

collect mutation frequencies corresponding to 1 PAM Explicit model Two families studied – immunoglobin, cytochrome C

Extrapolation of the data to a distance of 250 PAMs PAM250 was original Dayhoff matrix

Family of matrices – PAM10… PAM200 Matrix multiplication using PAM-1

7

PAM Matrices: salient points Derived from global alignments of closely related

sequences. Matrices for greater evolutionary distances are

extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100)

refers to the evolutionary distance; greater numbers are greater distances.

Does not take into account different evolutionary rates between conserved and non-conserved regions.

8

BLOSUM Matrices Henikoff, S. & Henikoff J.G. (1992) Use blocks of protein sequence fragments from

different families (the BLOCKS database) Amino acid pair frequencies calculated by

summing over all possible pairs in block Different evolutionary distances are incorporated

into this scheme with a clustering procedure (identity over particular threshold = same cluster)

9

BLOSUM Matrices Similar idea to PAM matrices Probabilities estimated from blocks of

sequence fragments Blocks represent structurally conserved

regions

10

BLOSUM Matrices Target frequencies are identified directly

instead of extrapolation. Sequences more than x% identitical within

the block where substitutions are being counted, are grouped together and treated as a single sequence BLOSUM 50 : >= 50% identity BLOSUM 62 : >= 62 % identity

11

BLOSUM Matrices: Salient points Derived from local, ungapped alignments of distantly

related sequences All matrices are directly calculated; no extrapolations

are used – no explicit model The number after the matrix (BLOSUM62) refers to

the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances.

The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49).

12

BLOSUM Example PSC Tutorial - BLOSUM example

http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html

13

Heuristic Alignment Algorithms Database searching vs. sequence alignment What is a heuristic? Why use heuristics? Approximations to Smith-Waterman

FASTA [Pearson & Lipman, 1988] BLAST [Altschul et al., 1990]

What are the tradeoffs in terms of search? Sensitivity vs. Selectivity

14

BLAST Overview BLAST heuristically finds maximal

segment pairs: highest scoring pair of identical length segments from 2 sequences

SP = ungapped, local alignment MSP = a segment pair (SP) with maximum

score over all segment pairs in S1 and S2

15

BLAST Overview Given: query sequence q, word length w, word

score threshold T, segment score threshold S Compile a list of “words” that score at least T when

compared to words from q Scan database for matches to words in list Extend all matches to seek high-scoring segment

pairs

Return: segment pairs scoring at least S

16

Determining Query Words Given:

Query sequence: QLNFSAGWWord length w = 2Word score threshold T = 8

Step 1: Determine all words of length w in query sequence

QL LN NF FS SA AG GW

17

Determining Query WordsStep 2: Determine all words that score at least

T when compared to a word in the query sequence

QL QL=11, QM=9, HL=8, ZL=9

LN LN=9, LB=8

….

18

Scanning the database Search database for all occurrences of

query words Approach:

Build a DFA that recognizes all query words Run DB sequences through DFA Remember hits

19

Finding MSPs Extend hits in both directions (without

allowing gaps) as long as score of segment pair increases

Return segment pairs scoring at least S

20

Choosing Values for w and T Trade-off: sensitivity vs. running-time Choosing a value for w

Small w: many matches to expand Big w: many words to be generated W=4 is a good compromise

Choosing a value for T Small T: greater sensitivity, more matches to

expand

21

BLAST Notes May fail to find optimal MSPs

May miss seeds if T is too stringent Extension is greedy

Empirically, 10 to 50 times faster than Smith-Waterman

Large impact: NCBI’s BLAST server handles more than 50,000 queries a day

22

Statistics of alignment scores(or how to choose a value for S) [Karlin & Altschul, 1990] A model of random sequences

Ungapped alignments All residues drawn independently Expected score for a pair of randomly chosen

residues required to be negative – Why? See text for math

23

FASTA Heuristic, exclusion method http://gcg.nhri.org.tw/fasta.html See PSC tutorial for examples:

www.cbmi.upmc.edu/~vanathi/syllabus.html

24

Readings for next class FASTA Summary for FASTA paper due

PAM BLOSUM

Documents

Transcript of PAM BLOSUM