MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid...

47
MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Transcript of MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid...

Page 1: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

MOTIFSMOTIFSMOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Page 2: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has or is conjectured to have a biological significance. Sequence motifs may be identical to each other or they may vary to a greater or lesser extent.

Page 3: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Domains, Patterns, Motifs, Repeats?

Page 4: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

For proteins, a sequence motif is distinguished from a structural motif, i.e., a motif formed by the three dimensional arrangement of amino acids, which may not be adjacent.

Example: N-glycosylation site motif

Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro.

Page 5: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

When a sequence motif appears in protein-coding regions, it may specify a "structural motif" of a protein. Short coding motifs in proteins include sites that label proteins for delivery to particular parts of a cell, or mark them for phosphorylation.

Noncoding sequences contain functional (i.e., regulatory) sequence motifs and motifs that are just "junk," such as satellite DNA.

Functional motifs in DNA play different roles, such as binding sites for proteins.

The discipline of bioinformatics concerns itself with the finding and the sequence characterization of motifs through computer-based techniques of sequence analysis.

Page 6: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Motif notationMotif notationConsider the N-glycosylation site motif:

Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro.

This pattern may be written as:

N{P}[ST]{P} where N = where N = AsnAsn, P = , P = ProPro, S = , S = SerSer, T = , T = ThrThr; {X} means ; {X} means any amino acid except X; and [XY] means either X or any amino acid except X; and [XY] means either X or Y. The notation [XY] does not give any indication of Y. The notation [XY] does not give any indication of the probability of X or Y occurring in the pattern.the probability of X or Y occurring in the pattern.

Page 7: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Identifying motifs: The challenge

• A microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed

– How can one gene have such drastic effects?

Page 8: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

• Gene X encodes regulatory protein, such as a transcription factor (TF)

• The 20 unexpressed genes rely on gene product (TF) to induce transcription

• A single TF may regulate multiple genes

Identifying motifs: The challenge

Page 9: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

• Every gene contains a regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site

• Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor

• TFs influence gene expression by binding to a specific location in the TFBS of the gene.

Identifying motifs: The challenge

Page 10: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

• A motif can be located anywhere within the Regulatory Region.

• Motifs may vary across different regulatory regions.

Identifying motifs: The challenge

Page 11: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Motifs and Transcriptional Start Sites

geneATCCCG

geneTTCCGG

geneATCCCG

geneATGCCG

geneATGCCC

Page 12: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Why finding motifs is difficult?Step 1: Start with random sequence

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Page 13: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Why finding motifs is difficult?Step 2: Implant motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Page 14: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Where is the implanted motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Page 15: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Implanting Motif AAAAAAGGGGGGG with Four Mutations

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaagga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Page 16: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Why Finding (15,4) Motif is Difficult?

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||

Page 17: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.
Page 18: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Discovery of Motifs 1. consensus sequences

The notation [XYZ] means X or Y or Z, but does not indicate the likelihood of any particular match. For this reason, two or more patterns are often associated with a single motif. It is sometimes advisable to look and consensus sequences and refine the definition of a motif.

Page 19: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Discovery of Motifs 1. consensus sequencesRigorously, the IQ motif is:

[FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY]

where x = any amino acid, and the square brackets indicate alternatives.

Usually, the first amino acid is I, the two [RK] choices are R, and xx[FILVWY] is so undefined that it can be ignored. Thus, the consensus is:

IQxxxRGxxxR

Page 20: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Discovery of Motifs 2. Discovery through evolutionary conservation

Motifs may be discovered by comparing homologous genes from different species. For example, by aligning the amino acid sequences specified by the GCM (glial cells missing) gene in man, mouse and D. melanogaster, a pattern was discovered (the GCM motif) that spans about 150 amino acids, and begins as follows:

WDIND*.*P..*...D.F.*W***.**.IYS**...A.*H*S*WAMRNTNNHN

Here each . signifies a single amino acid or a gap, and each * indicates one member of a closely-related amino-acid family.

Subsequently, it was shown that the motif has DNA binding activity.

Page 21: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Motif Logo

• Motifs can mutate on non important bases

• The five motifs in five different genes have mutations in position 3 and 5

• Representations called motif logos illustrate the conserved and variable regions of a motif

TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

Page 22: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Motif Logos: an Example

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Page 23: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Measure of Conservation

• Relative heights of letters reflect their abundance in the alignment.

• Total height = entropy-based measurement of conservation.

• Entropy(i) = -SUM { f(base, i)* ln[f(base, i)] }over all bases

• Entropy measures variability/disorder.– Highly conserved = low entropy = tall stack– Highly variable = high entropy = low stack

Page 24: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Identifying Motifs: Complications• We do not know the motif sequence

• We do not know where it is located relative to some genomic landmark (say, gene start)

• Motifs can differ from one another

• The pattern may not be an exact sequence or an approximate sequence but something like “4-8 hydrophobic amino acids, followed by 2-3 leucines or isoleucines, followed by 2 phenylalanines and an aspartic acid or 1 spartic acid and two glycines.

• We do not know the motif sequence

• We do not know where it is located relative to some genomic landmark (say, gene start)

• Motifs can differ from one another

• The pattern may not be an exact sequence or an approximate sequence but something like “4-8 hydrophobic amino acids, followed by 2-3 leucines or isoleucines, followed by 2 phenylalanines and an aspartic acid or 1 spartic acid and two glycines.

Page 25: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Discovery of Motifs 3. De novo computational discovery of motifs

Page 26: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

A Motif Finding Analogy

• The Motif Finding Problem is similar to the problem posed by Edgar Allan Poe (1809–1849) in The Gold Bug

Page 27: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

"The Gold-Bug" is a story of a man named William Legrand who seemingly goes mad after being bitten by a bug thought to be made of pure gold. He notifies his closest friend, the narrator, telling him to immediately come visit him at his home on Sullivan's Island in South Carolina. The two embark upon a search for lost treasure along with a servant named Jupiter. The narrator doubts Legrand’s sanity. However, after following several clues, they find a treasure buried by the infamous pirate "Captain Kidd," that is estimated to be worth about fourteen million dollars.

Among the clues, there is a secret message.

Page 28: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Gold Bug Problem

• Given a secret message:53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!83(88)5*!; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*; 4069285);)6!8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?34;48)4+;161;:188;+?;

• Decipher the message encrypted in the fragment

Page 29: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Hints for The Gold Bug Problem

• Additional hints:– The encrypted message is in English– Each symbol correspond to one letter

in the English alphabet– No punctuation marks are encoded

Page 30: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Gold Bug Problem: Symbol Counts

• Naive approach to solving the problem:– Count the frequency of each symbol in the

encrypted message– Find the frequency of each letter in the

alphabet in the English language– Compare the frequencies of the previous

steps, try to find a correlation and map the symbols to a letter in the alphabet

Page 31: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Symbol Frequencies in the Gold Bug Message

• Gold Bug Message:

• English Language:

e t a o i n s r h l d c u m f p g w y b v k x j q z

Most frequent Least frequent

Symbol 8 ; 4 ) + * 5 6 ( ! 1 0 2 9 3 : ? ` - ] .Frequency

34

25

19

16

15

14

12

11

9 8 7 6 5 5 4 4 3 2 1 1 1

Page 32: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Gold Bug Message Decoding: First Attempt

• By simply mapping the most frequent symbols to the most frequent letters of the alphabet:

sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnltarhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorleoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfataeoaitdrdtpdeetiwt

• The result does not make sense

Page 33: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Gold Bug Problem: l-tuple count

• A better approach:– Examine frequencies of l-tuples,

combinations of 2 symbols, 3 symbols, etc.

– “The” is the most frequent 3-tuple in English and “;48” is the most frequent 3-tuple in the encrypted text

– Make inferences of unknown symbols by examining other frequent l-tuples

Page 34: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Gold Bug Problem: the ;48 clue

• Mapping “the” to “;48” and substituting all occurrences of the symbols:

53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3hthe)h+t161t:1eet+?t

• Mapping “the” to “;48” and substituting all occurrences of the symbols:

53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3hthe)h+t161t:1eet+?t

Page 35: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Gold Bug Message Decoding: Second Attempt

• Make inferences:

53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3hthe)h+t161t:1eet+?t

• “thet(ee” most likely means “the tree”– Infer “(“ = “r”

• “th(+?3h” becomes “thr+?3h”– Can you guess “+”, “?”, and “3”?

• Make inferences:

53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3hthe)h+t161t:1eet+?t

• “thet(ee” most likely means “the tree”– Infer “(“ = “r”

• “th(+?3h” becomes “thr+?3h”– Can you guess “+”, “?”, and “3”?

ougoug

Page 36: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Gold Bug Problem: The Solution

• The final message is:

AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGREESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENTHLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINEFROMTHETREETHROUGHTHESHOTFIFTYFEETOUT

• The final message is:

AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGREESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENTHLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINEFROMTHETREETHROUGHTHESHOTFIFTYFEETOUT

Page 37: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

The Solution (cont’d)

• Punctuation (akin to annotation) is important:

A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA, TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH, MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF THE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT, FIFTY FEET OUT.

Page 38: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Solving the Gold Bug Problem

• Prerequisites to solve the problem:

– Need to know the relative frequencies of single letters, and combinations of two and three letters in English.

– Knowledge of all the words in the English dictionary is highly desirable.

• Prerequisites to solve the problem:

– Need to know the relative frequencies of single letters, and combinations of two and three letters in English.

– Knowledge of all the words in the English dictionary is highly desirable.

Page 39: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

– Nucleotides in motifs encode for a message in the “genetic” language. Symbols in “The Gold Bug” encode for a message in English.

– In order to solve the problem, we analyze the frequencies of patterns in DNA/Gold Bug message.

– Knowledge of established regulatory motifs makes the Motif Finding problem simpler. Knowledge of the words in the English dictionary helps to solve

The Gold Bug problem.

– Nucleotides in motifs encode for a message in the “genetic” language. Symbols in “The Gold Bug” encode for a message in English.

– In order to solve the problem, we analyze the frequencies of patterns in DNA/Gold Bug message.

– Knowledge of established regulatory motifs makes the Motif Finding problem simpler. Knowledge of the words in the English dictionary helps to solve

The Gold Bug problem.

Motif Finding and The Gold Bug Problem: Similarities

Page 40: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Similarities (cont’d)

• Motif Finding:

– In order to solve the problem, we analyze the frequencies of patterns in the nucleotide sequences

– In order to solve the problem, we analyze the frequencies of patterns in the nucleotide sequences

• The Gold Bug Problem:

– In order to solve the problem, we analyze the frequencies of patterns in the text written in English

• Motif Finding:

– In order to solve the problem, we analyze the frequencies of patterns in the nucleotide sequences

– In order to solve the problem, we analyze the frequencies of patterns in the nucleotide sequences

• The Gold Bug Problem:

– In order to solve the problem, we analyze the frequencies of patterns in the text written in English

Page 41: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Similarities (cont’d)

• Motif Finding:– Knowledge of established motifs reduces

the complexity of the problem

• The Gold Bug Problem:– Knowledge of the words in the dictionary is

highly desirable

• Motif Finding:– Knowledge of established motifs reduces

the complexity of the problem

• The Gold Bug Problem:– Knowledge of the words in the dictionary is

highly desirable

Page 42: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

Motif Finding and The Gold Bug Problem: Differences

Motif Finding is harder than the Gold Bug problem:

– We don’t have the complete dictionary of motifs

– The “genetic” language does not have a standard “grammar”

– Only a small fraction of nucleotide sequences encode for motifs; the size of data is enormous

Page 43: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

So, what do we do?So, what do we do?We use whatever knowledge we have, and teach the computer program to look for elements that abide by these rules.

So, what do we do?So, what do we do?We use whatever knowledge we have, and teach the computer program to look for elements that abide by these rules.

Page 44: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

• Similarity to something known

• Strand specificity (cis to the gene)

• Knowledge of length distribution

• May have known folds

• Taxonomic distribution

• Position specificity

Page 45: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

• Founded by Amos Bairoch• 1988 First release in the PC/Gene software• 1990 Synchronisation with Swiss-Prot• 1994 Integration of « profiles »• 1999 PROSITE joins InterPro• Release 20.57, of 23-Nov-2009

Page 46: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

• Contains biological annotation in addition to sequences. – catalytic, metal binding, S-S bridge, cofactor

binding, prosthetic group, PTM

Page 47: MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.

PROSITE Format (Pattern) Regular Expression Language (REGEXP)

• Pattern: <A-x-[ST](2)-x(0,1)-{V}• Regexp: ^A.[ST]{2}.?[^V]• Text: The sequence must start with an

alanine, followed by any amino acid, followed by a serine or a threonine, two times, followed by any amino acid or nothing, followed by any amino acid except a valine.