Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
-
Upload
domenic-miles -
Category
Documents
-
view
223 -
download
0
Transcript of Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
![Page 1: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/1.jpg)
Finding Patterns
Gopalan Vivek
Lee Teck Kwong Bernett
![Page 2: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/2.jpg)
Recap
Multiple Sequence Alignment
....|....| ....|....| ....|....| ....|....| ....|....| 665 675 685 695 705Sp1 ACTCPYCKDS EGRGSG---- DPGKKKQHIC HIQGCGKVYG KTSHLRAHLRSp2 ACTCPNCKDG EKRS------ GEQGKKKHVC HIPDCGKTFR KTSLLRAHVRSp3 ACTCPNCKEG GGRGTN---- -LGKKKQHIC HIPGCGKVYG KTSHLRAHLRSp4 ACSCPNCREG EGRGSN---- EPGKKKQHIC HIEGCGKVYG KTSHLRAHLRDrosBtd RCTCPNCTNE MSGLPPIVGP DERGRKQHIC HIPGCERLYG KASHLKTHLRDrosSp TCDCPNCQEA ERLGPAGV-- HLRKKNIHSC HIPGCGKVYG KTSHLKAHLRCeT22C8.5 RCTCPNCKAI KHG------- DRGSQHTHLC SVPGCGKTYK KTSHLRAHLRY40B1A.4 PQISLKKKIF FFIFSNFR-- GDGKSRIHIC HL--CNKTYG KTSHLRAHLR
![Page 3: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/3.jpg)
Introduction
Terms used in pattern finding is quite loose.
Terms may be used differently by different authors.
Thus there is a need to know the context in which the terms are used.
![Page 4: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/4.jpg)
....|....| ....|....| ....|....| ....|....| ....|....| ....|....| 665 675 685 695 705 715 Sp1 ACTCPYCKDS EGRGSG---- DPGKKKQHIC HIQGCGKVYG KTSHLRAHLR WHTGERPFMC Sp2 ACTCPNCKDG EKRS------ GEQGKKKHVC HIPDCGKTFR KTSLLRAHVR LHTGERPFVC Sp3 ACTCPNCKEG GGRGTN---- -LGKKKQHIC HIPGCGKVYG KTSHLRAHLR WHSGERPFVC Sp4 ACSCPNCREG EGRGSN---- EPGKKKQHIC HIEGCGKVYG KTSHLRAHLR WHTGERPFIC DrosBtd RCTCPNCTNE MSGLPPIVGP DERGRKQHIC HIPGCERLYG KASHLKTHLR WHTGERPFLC DrosSp TCDCPNCQEA ERLGPAGV-- HLRKKNIHSC HIPGCGKVYG KTSHLKAHLR WHTGERPFVC CeT22C8.5 RCTCPNCKAI KHG------- DRGSQHTHLC SVPGCGKTYK KTSHLRAHLR KHTGDRPFVC Y40B1A.4 PQISLKKKIF FFIFSNFR-- GDGKSRIHIC HL--CNKTYG KTSHLRAHLR GHAGNKPFAC
C2H2 Zinc finger motif
Prosite pattern
C-x(2,4)-C-x(12)-H-x(3)-H
![Page 5: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/5.jpg)
Motif– Common sequence elements shared by a
group of sequences. Indicative of functional or evolutionary relationship.
– N-Glycosylation site, N-{P}-[ST]-{P}
![Page 6: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/6.jpg)
Pattern– “A consistent, characteristic form, style, or
method, as a composite of traits or features characteristic of an individual or a group.” (dictionary.com)
– A physical expression of a motif.– Many forms of expression.
![Page 7: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/7.jpg)
![Page 8: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/8.jpg)
Signature/Print– A set of patterns that defines a group of
sequences having a certain common characteristic.
– Bacterial Rhodopsin (2 patterns)• R-Y-x-[DT]-W-x-[LIVMF]-[ST]-T-P-[LIVM](3)• [FYIV]-x-[FYVG]-[LIVM]-D-[LIVMF]-x-[STA]-K-
x(2)-[FY]
![Page 9: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/9.jpg)
A single point is not indicative of identity.
But many points allow for identification.
![Page 10: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/10.jpg)
![Page 11: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/11.jpg)
Why pattern finding and not sequence comparison? Useful in event of low sequence
similarity to infer function or family– Certain motifs are characteristic of function
or family.– Zinc finger motif, indicative of DNA binding.– Avidin motif, indicative of Avidin family of
proteins.
![Page 12: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/12.jpg)
Detection of specific motifs or signals– Example:
• Restriction Endonuclease sites – EcoRI
» 5’-G^AATT C-3’ (Sense strand)» 3’–C TTAA^G-3’ (Antisense strand)
• Transcription factor binding sites– GAL4
» CCCCAGaTTTTC
• Protein motifs– Zinc finger
![Page 13: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/13.jpg)
Usually faster than sequence comparison– Blast has to search using many fragments.– Pattern searching just search once
![Page 14: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/14.jpg)
Types of Patterns
DNA– Restriction Endonuclease sites– DNA binding motifs– Transcription Factor binding sites– Splicing site motifs– Other signals
![Page 15: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/15.jpg)
Protein– Sequence motifs
• Zinc finger• SH2 domains
– Structural patterns
![Page 16: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/16.jpg)
Representations
Regular Expression (RE) Prosite Patterns Profiles (PSSM) Hidden Markov Models (HMM)
![Page 17: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/17.jpg)
Sp1 CHIQGCGKVYGKTSHLRAHLRWHSp2 CHIPDCGKTFRKTSLLRAHVRLHSp3 CHIPGCGKVYGKTSHLRAHLRWHSp4 CHIEGCGKVYGKTSHLRAHLRWHDrosBtd CHIPGCERLYGKASHLKTHLRWHDrosSp CHIPGCGKVYGKTSHLKAHLRWHCeT22C8.5 CSVPGCGKTYKKTSHLRAHLRKHY40B1A.4 CHL--CNKTYGKTSHLRAHLRGH
Sequences containing zinc finger motif
![Page 18: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/18.jpg)
Regular Expression
Used in computer science Syntax:
Character Meaning
^ Match the beginning of the line
$ Match the end of the line
* Match 0 or more repetitions of preceding character
+ Match 1 or more repetitions of preceding character
![Page 19: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/19.jpg)
Character Meaning
? Match 0 or 1 occurrence of preceding character
{m} Match m repetition of preceding character
{m,n} Match range m to n repetition of preceding character
Char Match character
. Match any character
[] Match any character within bracket
[^Char] Not character
Zinc finger motif
C.{2,4}C.{12}H.{3}H
![Page 20: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/20.jpg)
Sp1 CHIQGCGKVYGKTSHLRAHLRWHSp2 CHIPDCGKTFRKTSLLRAHVRLHSp3 CHIPGCGKVYGKTSHLRAHLRWHSp4 CHIEGCGKVYGKTSHLRAHLRWHDrosBtd CHIPGCERLYGKASHLKTHLRWHDrosSp CHIPGCGKVYGKTSHLKAHLRWHCeT22C8.5 CSVPGCGKTYKKTSHLRAHLRKHY40B1A.4 CHL--CNKTYGKTSHLRAHLRGH
C.{2,4}C.{12}H.{3}H
Example
![Page 21: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/21.jpg)
Prosite Patterns
Very similar to RE Patterns encoded in Prosite style or RE
style can be switched easily between these two styles
More familiar to biologist
![Page 22: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/22.jpg)
RE Prosite
^ <
$ >
? (0,1)
{m} (m)
{m,n} (m,n)
Char Char
. x
[] []
[^char] {}
Zinc finger motif
REC.{2,4}C.{12}H.{3}H
PrositeC-x(2,4)-C-x(12)-H-x(3)-H
![Page 23: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/23.jpg)
Profiles
Similar to scoring matrices used in sequence comparison
The outcome of applying the matrices is a score
A threshold is used to determine whether it is a hit
![Page 24: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/24.jpg)
1 2 3 4 5 6 7 8Sp1 C H I Q G C G K VYGKTSHLRAHLRWHSp2 C H I P D C G K TFRKTSLLRAHVRLHSp3 C H I P G C G K VYGKTSHLRAHLRWHSp4 C H I E G C G K VYGKTSHLRAHLRWHDrosBtd C H I P G C E R LYGKASHLKTHLRWHDrosSp C H I P G C G K VYGKTSHLKAHLRWHCeT22C8.5 C S V P G C G K TYKKTSHLRAHLRKHY40B1A.4 C H L - - C N K TYGKTSHLRAHLRGHProfile
Pos A C D E F G H I K L M N P Q R S T V W X –1 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 1 0 0 0 0 03 0 0 0 0 0 0 0 6 0 1 0 0 0 0 0 0 0 1 0 0 04 0 0 0 1 0 0 0 0 0 0 0 0 5 1 0 0 0 0 0 0 15 0 0 1 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 07 0 0 0 1 0 6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 7 0 0 0 0 0 1 0 0 0 0 0 0
![Page 25: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/25.jpg)
Pos A C D E F G H I K L M N P Q R S T V W X –1 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 1 0 0 0 0 03 0 0 0 0 0 0 0 6 0 1 0 0 0 0 0 0 0 1 0 0 04 0 0 0 1 0 0 0 0 0 0 0 0 5 1 0 0 0 0 0 0 15 0 0 1 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 07 0 0 0 1 0 6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 7 0 0 0 0 0 1 0 0 0 0 0 0
seq – C H I Q G C G K – 8 + 7 + 6 + 1 + 6 + 8 + 6 + 7 = 49
![Page 26: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/26.jpg)
Sp1 CHIQGCGK = 8+7+6+1+6+8+6+7 = 49 Sp2 CHIPDCGK = 8+7+6+5+1+8+6+7 = 48Sp3 CHIPGCGK = 8+7+6+5+6+8+6+7 = 53Sp4 CHIEGCGK = 8+7+6+1+6+8+6+7 = 49DrosBtd CHIPGCER = 8+7+6+5+6+8+1+1 = 42DrosSp CHIPGCGK = 8+7+6+5+6+8+6+7 = 53CeT22C8.5 CSVPGCGK = 8+1+1+5+6+8+6+7 = 42Y40B1A.4 CHL--CNK = 8+7+1+1+1+8+1+7 = 34 <- lowest
Since all the sequences are known to contain the zinc finger motif, the threshold can be set at 34.
Thus any sequence having a lower score than the threshold will be rejected and any sequence having a higher score is likely to have the zinc finger motif.
Example
Unrelated seq – CADEGCEK – 8+0+0+1+6+8+1+7 = 31 REJECT
![Page 27: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/27.jpg)
The unrelated sequence was rejected due to a low score.
However if one was using a Prosite pattern, one would have accepted it.– C-x(2,4)-C-x(2) <= Prosite motif
Advantage of profile– More expressive, details are included– More sensitive– Provides a quantitative value
![Page 28: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/28.jpg)
Example provided is very simple It is possible to include
– Evolutionary distance– Amino acid frequency– Substitution matrix
This makes the profile even more accurate
![Page 29: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/29.jpg)
Hidden Markov Models (HMM)
Profiles are a special case of HMM HMM have a number of states Transitions from one state to another is
based on a set or probabilities called transitional probabilities
At each state an observation is generated
![Page 30: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/30.jpg)
It is known as HMM as only the observations are visible and the states hidden.
The probabilities are first determined using MSA.
The determined probabilities are then used to determine whether a sequence has the pattern or not.
![Page 31: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/31.jpg)
I1
M1
D2
M2
I1 I1
M1 M1
D2
A Short Profile HMM
I represents insertion states, M represents match states and D represents deletion state.
Both I and M emits amino acids.
![Page 32: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/32.jpg)
Sources and Creation of Patterns
Source of patterns– The source of patterns is mainly MSA.
Creation of patterns– Manually as in Prosite– Automatically through machine learning
• Meme• Pratt
![Page 33: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/33.jpg)
Considerations
Sensitivity/Recall– How much of the patterns were discovered– TP / (TP + FN)
Specificity/Precision– How many of the discovered patterns are correct– TP / (TP + FP)
It is usually a balance between these two measures.
![Page 34: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/34.jpg)
Ideal situation
Threshold
![Page 35: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/35.jpg)
Threshold
False PositiveFalse Negative
The real situation
![Page 36: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/36.jpg)
Other points:– A literature search can be done to identify
potential conserved/functional regions suitable for use in pattern creation.
• For example, Alanine Scanning may indicate a region of functional importance.
– All calculations of Sensitivity and Specificity is based on current state of database.
– Need to consider the coverage of existing database.
![Page 37: Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.](https://reader030.fdocuments.us/reader030/viewer/2022032703/56649f4f5503460f94c7161f/html5/thumbnails/37.jpg)
Summary
Definition of patterns and motifs Why use pattern finding Types of patterns Sources and Creation of Patterns