Calculating substitution matrices

Calculating substitution matrices

• http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5

Two models one random (R) and one match (M) for sequence alignmentThe random model assumes that letter a occurs independently with some frequency qa, the probability of the two sequences is just the product of the probabilities of each amino acid:P(x,y|R) =iqxi jqyj

http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5

http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5

Odds ratio

• The match model aligns residues with a joint probability pab

– P (x,y|M) = ipxiyi

• The ratio of match to random is known as odds ratio:

P(x,y|M)/P(x,y|R) = i (pxiyi/qxiqyi)

Log odds ratio

• s(a, b) = log (pab/qaqb)• S = i s(xi, yi)• This last equation is the sum of individual

scores for each aligned pair of residues. The first equation refers to scores in a matrix, for instance, proteins exhibit a 20 X 20 matrix known as a score or substitution matrix. (BLOSUM, PAM)

Significance of scores using alignment algorithms

• Calculate a raw Score– Sum of scores for each letter to letter and letter

to null position

• Calculate a bit score– Normalizes for scoring system used

• Calculate an E-value– Calculated from bit score to account for

probability the hit arose by chance

Raw score

• Calculated from substitution matrices (PAM, BLOSUM), and gap costs

• There are substitution matrices for nucleotides also:– States, D.J., Gish, W. & Altschul, S.F. (1991)

"Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.

Bit score

• S’ = (S – lnK)/ ln 2• lambda and K are parameters dependent upon

the scoring system (substitution matrix and gap costs) employed – Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the

statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA

87:2264-2268. – http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#lambda

• Gap costs – the standard cost associated with a gap of length g

http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#lambda

Gap costs• Can be linear – like we did in our matrix

(g) = -gd

• Can be an “affine” score – most prevalent now(g) = -d – (g-1)e

Where d is called the gap-open penalty and e is called the gap-extension penalty. The gap extension penalty e is usually less than the d, allowing long insertions and deletions to be penalized less

E - value

• E = N/2S’

• This is an approximation for the number (E) of distinct HSP’s with normalized score at least S’ expected to occur by chance when two random protein sequences of sufficient lengths m and n are compared

• N = mn (search space size)

Database searching

• If a protein is compared to whole database, n is the database length in residues

• The equation can be converted to:– S’ = log2(N/E)

• If a protein of length 250 might be compared to a protein database of 5 x 106 residues, to achieve a marginally significant E-value of 0.05 a normalized score of 38 bits is necessary

Significance of E - value

• E value is between 1 and 0

• The lower the E value the more significant the match

• Note that the E value is dependent on the length of query sequence – An E value of .05 is more significant for a query of 100 amino acids, than 200 amino acids

Calculating substitution matrices

Documents

Transcript of Calculating substitution matrices