Fa 05CSE182 CSE182-L6 Protein structure basics Protein sequencing.
CSE182-L5: Scoring matrices Dictionary Matching
description
Transcript of CSE182-L5: Scoring matrices Dictionary Matching
![Page 1: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/1.jpg)
Fa05 CSE 182
CSE182-L5: Scoring matrices
Dictionary Matching
![Page 2: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/2.jpg)
Fa05 CSE 182
Scoring DNA
• DNA has structure.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 3: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/3.jpg)
Fa05 CSE 182
DNA scoring matrices
• So far, we considered a simple match/mismatch criterion.
• The nucleotides can be grouped into Purines (A,G) and Pyrimidines.
• Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 4: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/4.jpg)
Fa05 CSE 182
Scoring proteins• Scoring protein sequence alignments is
a much more complex task than scoring DNA– Not all substitutions are equal
• Problem was first worked on by Pauling and collaborators
• In the 1970s, Margaret Dayhoff created the first similarity matrices.– “One size does not fit all”– Homologous proteins which are
evolutionarily close should be scored differently than proteins that are evolutionarily distant
– Different proteins might evolve at different rates and we need to normalize for that
![Page 5: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/5.jpg)
Fa05 CSE 182
PAM 1 distance
• Two sequences are 1 PAM apart if they differ in 1 % of the residues.
• PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart]
1% mismatch
![Page 6: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/6.jpg)
Fa05 CSE 182
PAM1 matrix
• Align many proteins that are very similar– Is this a problem?
• PAM1 distance is the probability of a substitution when 1% of the residues have changed
• Estimate the frequency Pb|a of residue a being substituted by residue b.
• S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)
![Page 7: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/7.jpg)
Fa05 CSE 182
PAM 1
![Page 8: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/8.jpg)
Fa05 CSE 182
PAM distance
• Two sequences are 1 PAM apart when they differ in 1% of the residues.
• When are 2 sequences 2 PAMs apart?
1 PAM
1 PAM
2 PAM
![Page 9: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/9.jpg)
Fa05 CSE 182
Higher PAMs
• PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)
• PAM2 = PAM1 * PAM1 (Matrix multiplication)
• PAM250
– = PAM1*PAM249
– = PAM1250
![Page 10: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/10.jpg)
Fa05 CSE 182
•S250(a,b) = log10(Pab/PaPb) = log10(PAM250(b|a)/Pb)
PAM250 based scoring matrix
![Page 11: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/11.jpg)
Fa05 CSE 182
Scoring using PAM matrices
• Suppose we know that two sequences are 250 PAMs apart.
• S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb)
• How does it help?– S250(A,V) >> S1(A,V)– Scoring of hum vs. Dros should be
using a higher PAM matrix than scoring hum vs. mus.
– An alignment with a smaller % identity could still have a higher score and be more significant
hum
mus
dros
![Page 12: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/12.jpg)
Fa05 CSE 182
BLOSUM series of Matrices
• Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions
• A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database.
• BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability.– In practice BLOSUM62 seems to work very well.
![Page 13: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/13.jpg)
Fa05 CSE 182
PAM vs. BLOSUM
• What is the correspondence?
• PAM1 Blosum1• PAM2 Blosum2
• Blosum62
• PAM250 Blosum100
![Page 14: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/14.jpg)
Fa05 CSE 182
The last step in Blast
• We have discussed– Alignments– Db filtering using keywords– E-values and P-values– Scoring matrices
• The last step: Database filtering requires us to scan a large sequence fast for matching keywords
![Page 15: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/15.jpg)
Fa05 CSE 182
Dictionary Matching, R.E. matching, and position specific scoring
![Page 16: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/16.jpg)
Fa05 CSE 182
Keyword search
• Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword.
• Question: Given a collection of strings (keywords), find all occrrences in a database string where they keyword might match.
![Page 17: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/17.jpg)
Fa05 CSE 182
Dictionary Matching
• Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.
• How fast can this be done?
1:POTATO2:POTASSIUM3:TASTE
P O T A S T P O T A T O
dictionary
database
![Page 18: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/18.jpg)
Fa05 CSE 182
Dict. Matching & string matching
• How fast can you do it, if you only had one word of length m?– Trivial algorithm O(nm) time– Pre-processing O(m), Search O(n) time.
• Dictionary matching
– Trivial algorithm (l1+l2+l3…)n
– Using a keyword tree, lpn (lp is the length of the longest pattern)
– Aho-Corasick: O(n) after preprocessing O(l1+l2..)
• We will consider the most general case
![Page 19: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/19.jpg)
Fa05 CSE 182
Direct Algorithm
P O P O P O T A S T P O T A T OP O T A T OP O T A T OP O T A T OP O T A T O P O T A T O
Observations:• When we mismatch, we (should) know something about
where the next match will be.• When there is a mismatch, we (should) know something
about other patterns in the dictionary as well.
![Page 20: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/20.jpg)
Fa05 CSE 182
P O T A T O
T UIS M
S ETA
The Trie Automaton
• Construct an automaton A from the dictionary– A[v,x] describes the transition from node v to a node w upon reading x.– A[u,’T’] = v, and A[u,’S’] = w– Special root node r– Some nodes are terminal, and labeled with the index of the dictionary
word.
1:POTATO2:POTASSIUM3:TASTE
1
2
3
w
vu
S
r
![Page 21: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/21.jpg)
Fa05 CSE 182
An O(lpn) algorithm for keyword matching
• Start with the first position in the db, and the root node.
• If successful transition– Increment current pointer– Move to a new node– If terminal node “success”
• Else– Retract ‘current’ pointer– Increment ‘start’ pointer– Move to root & repeat
![Page 22: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/22.jpg)
Fa05 CSE 182
Illustration:
P O T A T O
T UIS M
S ETA
P O T A S T P O T A T Ol c
v
S
1
![Page 23: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/23.jpg)
Fa05 CSE 182
Idea for improving the time
P O T A S T P O T A T O
• Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match– Then prefix(pattern j) = suffix [ first c-l characters of pattern(i))
l c
1:POTATO2:POTASSIUM3:TASTE
P O T A S S I U MT A S T E
Pattern i
Pattern j
![Page 24: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/24.jpg)
Fa05 CSE 182
Improving speed of dictionary matching
• Every node v corresponds to a string sv that is a prefix of some pattern.
• Define F[v] to be the node u such that su is the longest suffix of sv
• If we fail to match at v, we should jump to F[v], and commence matching from there
• Let lp[v] = |su|
P O T A T O
T UIS M
S ETA
1 2 3 4 5
67
89 10
11S
![Page 25: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/25.jpg)
Fa05 CSE 182
An O(n) alg. For keyword matching
• Start with the first position in the db, and the root node.
• If successful transition– Increment current pointer– Move to a new node– If terminal node “success”
• Else (if at root)– Increment ‘current’ pointer– Mv ‘start’ pointer– Move to root
• Else – Move ‘start’ pointer forward– Move to failure node
![Page 26: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/26.jpg)
Fa05 CSE 182
Illustration
P O T A S T P O T A T O
P O T A T O
T UIS M
S ETA
lc
v S
1
![Page 27: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/27.jpg)
Fa05 CSE 182
Time analysis
• In each step, either c is incremented, or l is incremented
• Neither pointer is ever decremented (lp[v] < c-l).
• l and c do not exceed n• Total time <= 2n
P O T A S T P O T A T Ol c
![Page 28: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/28.jpg)
Fa05 CSE 182
Blast: Putting it all together
• Input: Query of length m, database of size n
• Select word-size, scoring matrix, gap penalties, E-value cutoff
![Page 29: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/29.jpg)
Fa05 CSE 182
Blast Steps
1. Generate an automaton of all query keywords.2. Scan database using a “Dictionary Matching” algorithm (O(n)
time). Identify all hits.3. Extend each hit using a variant of “local alignment”
algorithm. Use the scoring matrix and gap penalties.4. For each alignment with score S, compute the bit-score, E-
value, and the P-value. Sort according to increasing E-value until the cut-off is reached.
5. Output results.
![Page 30: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/30.jpg)
Fa05 CSE 182
Protein Sequence Analysis
• What can you do if BLAST does not return a hit?– Sometimes, homology (evolutionary similarity) exists at
very low levels of sequence similarity.
• A: Accept hits at higher P-value. – This increases the probability that the sequence similarity
is a chance event.– How can we get around this paradox?– Reformulated Q: suppose two sequences B,C have the
same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?
![Page 31: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/31.jpg)
Fa05 CSE 182
Silly Quiz
![Page 32: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/32.jpg)
Fa05 CSE 182
Silly Quiz
![Page 33: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/33.jpg)
Fa05 CSE 182
Protein sequence motifs
• Premise: • The sequence of a protein sequence gives clues about its
structure and function.• Not all residues are equally important in determining
function.• Suppose we knew the key residues of a family. If our query
matches in those residues, it is a member. Otherwise, it is not.
• How can we identify these key residues?
![Page 34: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/34.jpg)
Fa05 CSE 182
Prosite
• In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function.
Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch
The PROSITE database, its status in 1999
![Page 35: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/35.jpg)
Fa05 CSE 182
Basic idea
• It is a heuristic approach. Start with the following:– A collection of sequences with the same function.– Region/residues known to be significant for maintaining
structure and function. • Develop a pattern of conserved residues around
the residues of interest• Iterate for appropriate sensitivity and specificity
![Page 36: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/36.jpg)
Fa05 CSE 182
EX: Zinc Finger domain
![Page 37: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/37.jpg)
Fa05 CSE 182
Proteins containing zf domains
How can we find a motif corresponding to a zf domain
![Page 38: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/38.jpg)
Fa05 CSE 182
From alignment to regular expressions
* ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS
ATH-[DE]
• Search Swissprot with the resulting pattern• Refine pattern to eliminate false positives• Iterate
![Page 39: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/39.jpg)
Fa05 CSE 182
The sequence analysis perspective
• Zinc Finger motif
– C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H – 2 conserved C, and 2 conserved H
• How can we search a database using these motifs?– The motif is described using a regular expression. What is
a regular expression?
![Page 40: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/40.jpg)
Fa05 CSE 182
Regular Expressions
• Concise representation of a set of strings over alphabet .
• Described by a string over• R is a r.e. if and only if
€
Σ,⋅,∗,+{ }
€
R = {ε} Base caseR = {σ },σ ∈ ΣR = R1 + R2 Union of stringsR = R1 ⋅R2 ConcatenationR = R
1
* 0 or more repetitions
![Page 41: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/41.jpg)
Fa05 CSE 182
Regular Expression
• Q: Let ={A,C,E}– Is (A+C)*EEC* a regular expression?– *(A+C)?– AC*..E?
• Q: When is a string s in a regular expression?– R =(A+C)*EEC*– Is CEEC in R?– AEC?– ACEE?
![Page 42: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/42.jpg)
Fa05 CSE 182
Regular Expression & Automata
Every R.E can be expressed by an automaton (a directed graph) with the following properties:– The automaton has a start and end node– Each edge is labeled with a symbol from , or
Suppose R is described by automaton AS R if and only if there is a path from start to end in A, labeled with s.
![Page 43: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/43.jpg)
Fa05 CSE 182
Examples: Regular Expression & Automata
• (A+C)*EEC*
CA
C
start endE E
![Page 44: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/44.jpg)
Fa05 CSE 182
Constructing automata from R.E
• R = {}• R = {}, • R = R1 + R2
• R = R1 · R2
• R = R1*
![Page 45: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/45.jpg)
Fa05 CSE 182
Regular Expression Matching
• Given a database D, and a regular expression R, is a substring of D in R?
• Is there a string D[l..c] that is accepted by the automaton of R?
• Simpler Q: Is D[1..c] accepted by the automaton of R?
![Page 46: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/46.jpg)
Fa05 CSE 182
Alg. For matching R.E.
• If D[1..c] is accepted by the automaton RA
– There is a path labeled D[1]…D[c] that goes from START to END in RA
D[1] D[2] D[c]
![Page 47: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/47.jpg)
Fa05 CSE 182
Alg. For matching R.E.
• If D[1..c] is accepted by the automaton RA
– There is a path labeled D[1]…D[c] that goes from START to END in RA
– There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END
D[1] .. D[c-1]
D[c]
u
![Page 48: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/48.jpg)
Fa05 CSE 182
D.P. to match regular expression
• Define:– A[u,] = Automaton
node reached from u after reading
– Eps(u): set of all nodes reachable from node u using epsilon transitions.
– N[c] = subset of nodes reachable from START node after reading D[1..c]
– Q: when is v N[c]
uu vv
uu Eps(u)Eps(u)
![Page 49: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/49.jpg)
Fa05 CSE 182
• Q: when is v N[c]?• A: If for some u N[c-1], w = A[u,D[c]],
• v {w}+ Eps(w)
D.P. to match regular expression
![Page 50: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/50.jpg)
Fa05 CSE 182
Algorithm
![Page 51: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/51.jpg)
Fa05 CSE 182
The final step
• We have answered the question:– Is D[1..c] accepted by R?– Yes, if END N[c]
• We need to answer – Is D[l..c] (for some l, and some c) accepted
by R
€
D[l..c]∈ R ⇔ D[1..c]∈ Σ∗R
![Page 52: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/52.jpg)
Fa05 CSE 182
Profiles versus regular expressions
• Regular expressions are intolerant to an occasional mis-match.
• The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members.
• Profiles capture some of these ideas.
![Page 53: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/53.jpg)
Fa05 CSE 182
Profiles
• Start with an alignment of strings of length m, over an alphabet A,
• Build an |A| X m matrix F=(fki)
• Each entry fki represents the frequency of symbol k in position i
0.71
0.14
0.14
0.28
![Page 54: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/54.jpg)
Fa05 CSE 182
Scoring Profiles
€
S(i, j) = fki
k
∑ M rk,s j[ ]
k
i
s
fki
Scoring Matrix
![Page 55: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/55.jpg)
Fa05 CSE 182
Psi-BLAST idea
• Multiple alignments are important for capturing remote homology.
• Profile based scores are a natural way to handle this.
• Q: What if the query is a single sequence.• A: Iterate:
– Find homologs using Blast on query– Discard very similar homologs– Align, make a profile, search with profile.
![Page 56: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/56.jpg)
Fa05 CSE 182
Psi-BLAST speed
• Two time consuming steps.1. Multiple alignment of homologs2. Searching with Profiles.
1. Does the keyword search idea work?
• Multiple alignment:– Use ungapped multiple
alignments only
• Pigeonhole principle again: – If profile of length m must score
>= T– Then, a sub-profile of length l must
score >= lT/m– Generate all l-mers that score at
least lT|/M– Search using an automaton
![Page 57: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/57.jpg)
Fa05 CSE 182
Databases of Motifs
• Functionally related proteins have sequence motifs.• The sequence motifs can be represented in many ways,
and different biological databases capture these representations– Collection of sequences (SMART)– Multiple alignments (BLOCKS)– Profiles (Pfam (HMMs)/Impala))– Regular Expressions (Prosite)
• Different representations must be queried in different ways
![Page 58: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/58.jpg)
Fa05 CSE 182
Databases of protein domains
![Page 60: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/60.jpg)
Fa05 CSE 182
PROSITE
http://us.expasy.org/prosite/
![Page 61: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/61.jpg)
Fa05 CSE 182
![Page 62: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/62.jpg)
Fa05 CSE 182
BLOCKS
![Page 63: CSE182-L5: Scoring matrices Dictionary Matching](https://reader035.fdocuments.us/reader035/viewer/2022062423/56814e8d550346895dbc3226/html5/thumbnails/63.jpg)
Fa05 CSE 182