Calculating substitution matrices
-
Upload
lucius-farmer -
Category
Documents
-
view
10 -
download
0
description
Transcript of Calculating substitution matrices
Calculating substitution matrices
• http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5
Two models one random (R) and one match (M) for sequence alignmentThe random model assumes that letter a occurs independently with some frequency qa, the probability of the two sequences is just the product of the probabilities of each amino acid:P(x,y|R) =iqxi jqyj
Odds ratio
• The match model aligns residues with a joint probability pab
– P (x,y|M) = ipxiyi
• The ratio of match to random is known as odds ratio:
P(x,y|M)/P(x,y|R) = i (pxiyi/qxiqyi)
Log odds ratio
• s(a, b) = log (pab/qaqb)• S = i s(xi, yi)• This last equation is the sum of individual
scores for each aligned pair of residues. The first equation refers to scores in a matrix, for instance, proteins exhibit a 20 X 20 matrix known as a score or substitution matrix. (BLOSUM, PAM)
Significance of scores using alignment algorithms
• Calculate a raw Score– Sum of scores for each letter to letter and letter
to null position
• Calculate a bit score– Normalizes for scoring system used
• Calculate an E-value– Calculated from bit score to account for
probability the hit arose by chance
Raw score
• Calculated from substitution matrices (PAM, BLOSUM), and gap costs
• There are substitution matrices for nucleotides also:– States, D.J., Gish, W. & Altschul, S.F. (1991)
"Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.
Bit score
• S’ = (S – lnK)/ ln 2• lambda and K are parameters dependent upon
the scoring system (substitution matrix and gap costs) employed – Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the
statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA
87:2264-2268. – http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#lambda
• Gap costs – the standard cost associated with a gap of length g
Gap costs• Can be linear – like we did in our matrix
(g) = -gd
• Can be an “affine” score – most prevalent now(g) = -d – (g-1)e
Where d is called the gap-open penalty and e is called the gap-extension penalty. The gap extension penalty e is usually less than the d, allowing long insertions and deletions to be penalized less
E - value
• E = N/2S’
• This is an approximation for the number (E) of distinct HSP’s with normalized score at least S’ expected to occur by chance when two random protein sequences of sufficient lengths m and n are compared
• N = mn (search space size)
Database searching
• If a protein is compared to whole database, n is the database length in residues
• The equation can be converted to:– S’ = log2(N/E)
• If a protein of length 250 might be compared to a protein database of 5 x 106 residues, to achieve a marginally significant E-value of 0.05 a normalized score of 38 bits is necessary
Significance of E - value
• E value is between 1 and 0
• The lower the E value the more significant the match
• Note that the E value is dependent on the length of query sequence – An E value of .05 is more significant for a query of 100 amino acids, than 200 amino acids