PAM BLOSUM

24
06/09/22 1 Scoring Matrices, Database Searching and Heuristic Alignment Algorithms ISSP 2081 / BIOINF 2051 Fall 2002 Lecture #6

Transcript of PAM BLOSUM

Page 1: PAM BLOSUM

04/10/23 1

Scoring Matrices, Database Searching and Heuristic Alignment Algorithms

ISSP 2081 / BIOINF 2051

Fall 2002

Lecture #6

Page 2: PAM BLOSUM

2

Handouts on Weight matrices Weight matrices for sequence similarity sc

oring by David Wheeler

Supplement to above by David Wheeler

Page 3: PAM BLOSUM

3

PAM Matrices First substitution matrices widely used Based on the point-accepted-mutation

(PAM) model of evolution (Dayhoff..1978) PAMs are relative measures of

evolutionary distance 1 PAM = 1 accepted mutation per 100 AAs Does not mean that after 100 PAMs every

AA will be different? Why or why not?

Page 4: PAM BLOSUM

4

PAM Matrices If changes were purely random

Frequency of each possible substitution is proportional to background frequencies

In related proteins: Observed substitution frequencies called the target

(replacement) frequencies are biased toward those that do not seriously disrupt the protein’s function

These point mutations are “accepted” during evolution Log-odds approach:

Scores proportional to the natural log of the ratio of target frequencies to background frequencies

Page 5: PAM BLOSUM

5

The Math Score matrix entry for time t given by:

s(a,b|t) = log P(b|a,t)

qb

Conditional probability that a is substituted by b in time t

Frequency of amino acid b

Page 6: PAM BLOSUM

6

PAM Matrices Construction Pairs of very closely related sequences used to

collect mutation frequencies corresponding to 1 PAM Explicit model Two families studied – immunoglobin, cytochrome C

Extrapolation of the data to a distance of 250 PAMs PAM250 was original Dayhoff matrix

Family of matrices – PAM10… PAM200 Matrix multiplication using PAM-1

Page 7: PAM BLOSUM

7

PAM Matrices: salient points Derived from global alignments of closely related

sequences. Matrices for greater evolutionary distances are

extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100)

refers to the evolutionary distance; greater numbers are greater distances.

Does not take into account different evolutionary rates between conserved and non-conserved regions.

Page 8: PAM BLOSUM

8

BLOSUM Matrices Henikoff, S. & Henikoff J.G. (1992) Use blocks of protein sequence fragments from

different families (the BLOCKS database) Amino acid pair frequencies calculated by

summing over all possible pairs in block Different evolutionary distances are incorporated

into this scheme with a clustering procedure (identity over particular threshold = same cluster)

Page 9: PAM BLOSUM

9

BLOSUM Matrices Similar idea to PAM matrices Probabilities estimated from blocks of

sequence fragments Blocks represent structurally conserved

regions

Page 10: PAM BLOSUM

10

BLOSUM Matrices Target frequencies are identified directly

instead of extrapolation. Sequences more than x% identitical within

the block where substitutions are being counted, are grouped together and treated as a single sequence BLOSUM 50 : >= 50% identity BLOSUM 62 : >= 62 % identity

Page 11: PAM BLOSUM

11

BLOSUM Matrices: Salient points Derived from local, ungapped alignments of distantly

related sequences All matrices are directly calculated; no extrapolations

are used – no explicit model The number after the matrix (BLOSUM62) refers to

the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances.

The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49).

Page 12: PAM BLOSUM

12

BLOSUM Example PSC Tutorial - BLOSUM example

http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html

Page 13: PAM BLOSUM

13

Heuristic Alignment Algorithms Database searching vs. sequence alignment What is a heuristic? Why use heuristics? Approximations to Smith-Waterman

FASTA [Pearson & Lipman, 1988] BLAST [Altschul et al., 1990]

What are the tradeoffs in terms of search? Sensitivity vs. Selectivity

Page 14: PAM BLOSUM

14

BLAST Overview BLAST heuristically finds maximal

segment pairs: highest scoring pair of identical length segments from 2 sequences

SP = ungapped, local alignment MSP = a segment pair (SP) with maximum

score over all segment pairs in S1 and S2

Page 15: PAM BLOSUM

15

BLAST Overview Given: query sequence q, word length w, word

score threshold T, segment score threshold S Compile a list of “words” that score at least T when

compared to words from q Scan database for matches to words in list Extend all matches to seek high-scoring segment

pairs

Return: segment pairs scoring at least S

Page 16: PAM BLOSUM

16

Determining Query Words Given:

Query sequence: QLNFSAGWWord length w = 2Word score threshold T = 8

Step 1: Determine all words of length w in query sequence

QL LN NF FS SA AG GW

Page 17: PAM BLOSUM

17

Determining Query WordsStep 2: Determine all words that score at least

T when compared to a word in the query sequence

QL QL=11, QM=9, HL=8, ZL=9

LN LN=9, LB=8

….

Page 18: PAM BLOSUM

18

Scanning the database Search database for all occurrences of

query words Approach:

Build a DFA that recognizes all query words Run DB sequences through DFA Remember hits

Page 19: PAM BLOSUM

19

Finding MSPs Extend hits in both directions (without

allowing gaps) as long as score of segment pair increases

Return segment pairs scoring at least S

Page 20: PAM BLOSUM

20

Choosing Values for w and T Trade-off: sensitivity vs. running-time Choosing a value for w

Small w: many matches to expand Big w: many words to be generated W=4 is a good compromise

Choosing a value for T Small T: greater sensitivity, more matches to

expand

Page 21: PAM BLOSUM

21

BLAST Notes May fail to find optimal MSPs

May miss seeds if T is too stringent Extension is greedy

Empirically, 10 to 50 times faster than Smith-Waterman

Large impact: NCBI’s BLAST server handles more than 50,000 queries a day

Page 22: PAM BLOSUM

22

Statistics of alignment scores(or how to choose a value for S) [Karlin & Altschul, 1990] A model of random sequences

Ungapped alignments All residues drawn independently Expected score for a pair of randomly chosen

residues required to be negative – Why? See text for math

Page 23: PAM BLOSUM

23

FASTA Heuristic, exclusion method http://gcg.nhri.org.tw/fasta.html See PSC tutorial for examples:

www.cbmi.upmc.edu/~vanathi/syllabus.html

Page 24: PAM BLOSUM

24

Readings for next class FASTA Summary for FASTA paper due