Motif Search
-
Upload
tamanna-darshan -
Category
Documents
-
view
38 -
download
0
description
Transcript of Motif Search
![Page 1: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/1.jpg)
Motif Search
![Page 2: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/2.jpg)
What are Motifs
• Motif (dictionary) A recurrent thematic element, a common theme
![Page 3: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/3.jpg)
Find a common motif in the text
![Page 4: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/4.jpg)
Find a short common motif in the text
![Page 5: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/5.jpg)
Motifs in biological sequences
Sequence motifs represent a short common sequence (length 4-20) which is highly represented in the data
![Page 6: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/6.jpg)
Challenges in biological sequencesMotifs are usually not exact words
![Page 7: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/7.jpg)
How to present non exact motifs?
• Consensus string NTAHAWT
May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; H=not G; S = C/G; R = A/G; Y = T/C etc.
• Position Weight Matrix (PWM)
Probability for each base
in each position A
T
GC
1 2 3 4 5 6
0.1 0.7 0.2 0.6 0.5 0.1
0.7 0.1 0.5 0.2 0.2 0.8
0.1 0.1 0.1 0.1 0.1 0.0
0.1 0.1 0.2 0.1 0.1 0.1
![Page 8: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/8.jpg)
Motifs in biological sequences
– Regulatory motifs in DNA (transcription factor binding sites)
– Functional site in proteins (Phosphorylation site)
What can we learn from these motifs?
![Page 9: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/9.jpg)
DNA Regulatory Motifs
• Transcription Factors (TF) are regulatory protein that bind to regulatory motifs near the gene and act as a switch bottom (on/off)
– TF binding motifs are usually 6 – 20 nucleotides long
– located near target gene, mostly upstream the transcription start site
Transcription Start Site
TF2motif
TF1motif
Gene X
TF1 TF2
![Page 10: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/10.jpg)
Can we find TF targets using a bioinformatics approach?
![Page 11: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/11.jpg)
P53 is a transcription factorinvolved in most human cancers
We are interested to identify the genes regulated by p53
![Page 12: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/12.jpg)
Finding TF targets using a bioinformatics approach?
Scenario 1 : Binding motif is known (easier case)
Scenario 2 : Binding motif is unknown (hard case)
![Page 13: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/13.jpg)
Scenario 1 : Binding motif is known
• Given a motif (e.g., consensus string, or weight matrix), find the binding sites in an input sequence
![Page 14: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/14.jpg)
Given a consensus :
For each position l in the input sequence, check if substring starting at position l matches the motif. Example: find the consensus motif NTAHAWT in the promoter of a gene
>promoter of gene AACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAATTCATACGGACTCAGACCTTAAAA…….
![Page 15: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/15.jpg)
Given a Position Weight Matrix (PWM):
Seq 1 AAAGCCCSeq 2 CTATCCASeq 3 CTATCCCSeq 4 CTATCCCSeq 5 GTATCCCSeq 6 CTATCCCSeq 7 CTATCCCSeq 8 CTATCCCSeq 9 TTATCTG
Starting from a set of aligned motifs
![Page 16: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/16.jpg)
Given a Position Weight Matrix (PWM):
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
W
.11 .11 1 1 0 0 0 .11 A
.67 0 0 0 0 1 .89 .78 C
.11 0 0 0 .11 0 0 .11 G
.11 .89 0 0 .89 0 .11 0 T
Counts of each baseIn each column
Probability of each baseIn each column
Wk = probability of base in column k
• Given a string s of length l = 7• s = s1s2…sl
• Pr(s | W) =
• Example: Pr(CTAATCCG) = 0.67 x 0.89 x 1 x 1 x 0.89x 1 x 0.89 x 0.11
k
Wsk k
![Page 17: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/17.jpg)
Given a Position Weight Matrix (PWM)
• Given sequence S (e.g., 1000 base-pairs long)• For each substring s of S,
– Compute Pr(s|W)
– If Pr(s|W) > some threshold, call that a binding site
• In DNA sequences we need to search both strands AGTTACACCA
TGGTGTAACT (reverse complement)
![Page 18: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/18.jpg)
Scenario 2 : Binding motif is unknown
“Ab initio motif finding”
![Page 19: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/19.jpg)
Ab initio motif finding: Expectation Maximization
• Local search algorithm
- Start from a random PWM– Move from one PWM to another so as to
improve the score which fits the sequence to the motif
– Keep doing this until no more improvement is obtained : Convergence to local optima
![Page 20: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/20.jpg)
Expectation Maximization
• Let W be a PWM . Let S be the input sequence . • Imagine a process that randomly searches,
picks different strings matching W and threads them together to a new PWM
![Page 21: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/21.jpg)
Expectation Maximization
• Find W so as to maximize Pr(S|W)
• The “Expectation-Maximization” (EM) algorithm iteratively finds a new motif W that improves Pr(S|W)
![Page 22: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/22.jpg)
Expectation Maximization
PWMStart from a random motif1.
Scan sequence for good matches to the current motif.
2.
3. Build a new PWM out of these matches, and make it the new motif
![Page 23: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/23.jpg)
The final PWM represents the motif which is mostly enriched in the data
-A letter’s height indicates the information it contains -The top letter at each position can be read to obtain the consensus sequence (motif)
The PWM can be also represented as a sequence logo
![Page 24: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/24.jpg)
Are common motifs the right thing to search for ?
![Page 25: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/25.jpg)
?
![Page 26: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/26.jpg)
Solutions:
-Searching for motifs which are enriched in one set but not in a random set
- Use experimental information to rank the sequences according to their binding affinity and search for enriched motifs at the top of the list
![Page 27: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/27.jpg)
Searching for enriched motifs in a ranked list
1
234
Bin
ding
aff
init
y
k= number of motifs in the top of the listm= number of sequences in the top of the list
n= number of total motifs foundN= total number of sequences
The P reflects the surprise of seeing the observed density of motif occurrences at the top of the list compared to the rest of the list.
Hyper Geometric (HG) Distribution test
![Page 28: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/28.jpg)
Searching for enriched motifs in ranked list
1
234
Bin
ding
aff
init
y
k= number of motifs in the top of the listm= number of sequences in the top of the list
n= number of total motifs foundN= total number of sequences
Choosing the best way to cut the list (minimal HG score)
![Page 29: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/29.jpg)
Finding the p53 binding motif in a set of p53 target sequences which are ranked according
to binding affinity >affinity = 5.962ACAAAAGCGUGAACACUUCCACAUGAAAUUCGUUUUUUGUCCUUUUUUUUCUCUUCUUUUUCUCUCCUGUUUCU>affinity = 5.937AAUAAAAAUAGAUAUAAUAGAUGGCACCGCUCUUCACGCCCGAAAGUUGGACAUUUUAAAUUUUAAUUCUCAUGA> affinity = 5.763UCACACUUGAAUGUGCUGCACUUUACUAGAAGUUUCUUUUUCUUUUUUUAAAAAUAAAAAAAGAGGAGAAAAAUGC>affinity = 5.498GCUGGUGCAAGUUUCCGGUAAAAAUAAUGAUGUUCUAGUCAUUCAUAUAUACGAUACAAAAAUAACA...
http://drimust.technion.ac.il/
![Page 30: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/30.jpg)
P[ED]XK[RW][RK]X[ED]
Protein Motifs
Protein motifs are usually 6-20 amino acids long andcan be represented as a consensus/profile:
or as PWM
![Page 31: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/31.jpg)
Protein Domains• In additional to protein short motifs, proteins are
characterized by Domains. • Domains are long motifs (30-100 aa) and are
considered as the building blocks of proteins (evolutionary modules).
The zinc-finger domain
![Page 32: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/32.jpg)
Some domains can be found in many proteins with different functions:
![Page 33: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/33.jpg)
….while other domains are only found in proteins with a certain function…..
MBD= Methylated DNA Binding Domain
![Page 34: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/34.jpg)
Varieties of protein domains
Page 228
Extending along the length of a protein
Occupying a subset of a protein sequence
Occurring one or more times
![Page 35: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/35.jpg)
Pfam
> Database that contains a large collection of multiple sequence alignments of protein domains
Based on Profile hidden Markov Models (HMMs).
HMM in comparison to PWM is a modelwhich considers dependencies between thedifferent columns in the matrix (different residues) and is thus much more powerful!!!!
http://pfam.sanger.ac.uk/
![Page 36: Motif Search](https://reader035.fdocuments.us/reader035/viewer/2022062221/56813336550346895d9a2f7e/html5/thumbnails/36.jpg)
Profile HMM (Hidden Markov Model)can accurately represent a MSA
D16 D17 D18 D19
M16 M17 M18 M19
I16 I19I18I17
100%
100% 100%
100%
D 0.8S 0.2
P 0.4R 0.6
T 1.0 R 0.4S 0.6
X XX X
50%
50%D R T RD R T SS - - SS P T RD R T RD P T SD - - SD - - SD - - SD - - R
16 17 18 19
Match
delete
insert