Incremental Inference of Relational Motifs with a Degenerate Alphabet Nadia Pisanti, LIPN Paris 13 &...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of Incremental Inference of Relational Motifs with a Degenerate Alphabet Nadia Pisanti, LIPN Paris 13 &...
Incremental Inference of Relational Motifs with a Degenerate Alphabet
Nadia Pisanti, LIPN Paris 13 & ABI Paris 6
joint work with:
H.Soldano, LIPN Paris 13 & ABI Paris 6M.Carpentier, ABI Paris 6
CPM 2005
Summary
Relational Motifs: The model. A few motivations.
Previous work: KMR: the idea of the paradigm. KMRC: using a degenerate alphabet &
maximality. KMRELAT=KMRC + “relations”. Problems.
The algorithm: The idea. Properties that guarantee correctness and
efficiency. Preliminary tests on 3D Proteins.
Relational Motifs: the idea
Motif TCAGTCTCA
occurrences
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
Alphabet: {A,C,G,T}Length: 9Quorum: 5Sequence
IN
OUT
“normal” motifs:
Relational Motifs: the idea
Motif TCAGTCTCA
occurrences
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
Alphabet: {A,C,G,T}Length: 9Quorum: 5Sequence
IN
OUT
“normal” motifs:
Relational Motifs: the idea
Motif TCAGTCTCA
occurrences
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
TCAGTCTCA
Alphabet: {A,C,G,T}Length: 9Quorum: 5Sequence
IN
OUT
Alphabet: natural numbersLength: 3Quorum: 4Relations alphabet: {<,>,=}.Sequence and pairwise relations
IN
OUT
relational motifs (trivial case: only relations):
“normal” motifs:
252
161 161
151
151Relations: (p1,p2) <, (p2,p3) >, and (p1,p3) =
Music: detecting scales and tunes in different keys.
3D structures of proteins: amino acids in 3D space and pairwise distance as relation.
Motif inference in structured texts: data structures, source codes.
Numbers and arithmetical relations. Events and temporal relations.
Some possible applications
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k/2:
occurs at b1,b2,b3,...
occurs at r1,r2,r3,...
occurs at y1,y2,y3,...
y1
y2
b1
b8 r7
r2b2 b3r1 r3
y7 r8
distance k/2 = | | = | |
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k/2:
occurs at b1,b2,b3,...
occurs at r1,r2,r3,...
occurs at y1,y2,y3,...
y1
y2
b1
b8 r7
r2b2 b3r1 r3
y7 r8
distance k/2 = | | = | |, hence = occurs at y7and | | = k
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k:
occurs at y2,y7,...
y1
y2
b1
b8 r7
r2b2 b3r1 r3
y7
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k:
occurs at y2,y7,...
occurs at y1,...
y1
y2
b1
b8 r7
r2b2r1 r3
y7
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k:
occurs at y2,y7,...
occurs at y1,...
y1
y2
b1
b8
r2b2 r3
y7
occurs at b1,b8.
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k:
occurs at y2,y7,...
occurs at y1,...
y1
y2
b1
b8 y7
occurs at b1,b8.
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k:
occurs at y2,y7,...
occurs at y1,...
y1
y2
b1
b8 y7
occurs at b1,b8.no quorum
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k:
occurs at y2,y7,...
occurs at y1,...
y1
y2
y7
OUTPUT:
With O(log k) steps all k-motifs are generated
KMR (1972): concatenation of short motifs into long ones
Ex: Length k with quorum 3
Length k:
occurs at y2,y7,y5
occurs at y1,...
y1
y2
y7
OUTPUT:
extent of y5
KMRC: Degenerate AlphabetFinding exact motifs is not enough for certain applications...
(nor so challenging!)
Motif {C,T}{C,G}A{C,T}A
CCATA
TCACA
TGATA
TCACACGACA
TCATATGATA
Length: 5Quorum: 5Input Sequence (hence implicitely {A,C,G,T})
&Motif’s alphabet (cover of ): {A},{C,G},{C,T}
IN
OUT
KMRC: Degenerate AlphabetFinding exact motifs is not enough for certain applications...
(nor so challenging!)
Motif {C,T}{C,G}A{C,T}A
CCATA
TCACA
TGATA
TCACACGACA
TCATATGATA
Length: 5Quorum: 5Input Sequence (hence implicitely {A,C,G,T}).
&Motif’s alphabet (cover of ): {A},{C,G},{C,T}
IN
OUT
Alphabet with degeneracy 2
Sequence alphabet: amino acids of primary structure.
Motifs degenerate alphabeth: grouping amino acids with similar chemical properties.
Relations: distance of -carbons in 3D tertiary structure of the protein.
Relations degenerate alphabet: e.g. discretizing the distance.
Repeated structures in 3D proteins
Relational motifs with degenerate alphabet(one for symbols and one for relations)
1
2
3
4
8
76
955
3
4
3
3
44
3
2
44 6
r(1,4) = 5r(1,3) = 4 r(2,4) = 3r(1,2) = 3 r(2,3) = 4 r(3,4) =3
r(6,9) = 6r(6,8) = 3 r(7,9) = 4r(6,7) = 4 r(7,8) = 4 r(8,9) =2
Motifs & Extents
In AAAAAAAAAAAAAAAAAAAAA ... AAAAAAAAAAAAAAAAAAAAAAAAAAAA = An
• Every k-long word on {C1,C2,C3} is a different k-motif!• Each one of them has extents {1,2,3,…,n-k+1} (indeed, it’s always the same…)
Hence {1,2,3,…,n-k+1} represents gk motifs!
E.g. Motif’s alphabet C1={A,C},C2={A,G},C3={A,T} with degeneracy g=3
A k-motif is a k-long word on {C1,C2,C3} (occurring q times)
Different motifs can occur at the same position...
Even worse: two different motifs may have the very same extents
Maximal motifs
Motif C2 C2 C1 C2 C1 occurs at p3,p4,p5,p6,p7 non maximal
CCACA
CCACA
CGATA
CCACACGACA
CCACACGATA
e.g. motif’s alphabet: C1={A},C2={C,G},C3={C,T}
p1 p2p3
p4p5
p6 p7
Motif C2 C2 C1 C3 C1 occurs at p3,p4,p5,p6,p7 and in p1,p2 maximal
Maximal motifs: good and bad news
Good news: Each maximal k-motif can be built from two
maximal (k/2)-motifs. Bad news:
Two maximal (k/2)-motifs can generate a non-maximal motif.
Non-maximal motifs have to be detected and discarded at each step.
Very bad news: There can be an exponential number of
maximal motifs (theoretically).
KMRelat (2003): introducing relations
Ex: Length k with quorum 3
Length k/2:
occurs at r1,r2,r3,...
occurs at y1,y2,y3,...
KMRelat (2003): introducing relations
Ex: Length k with quorum 3
Length k/2:
occurs at r1,r2,r3,... and relations are conserved
occurs at y1,y2,y3,... and relations are conserved
KMRelat (2003): introducing relations
Ex: Length k with quorum 3
Length k:
occurs at y1,y2,y3,... and SOME relations are conserved
KMRelat (2003): introducing relations
Ex: Length k with quorum 3
Length k:
occurs at y1,y2,y3,... and SOME relations are conserved
KMRelat (2003): introducing relations
Ex: Length k with quorum 3
Length k:
occurs at y1,y2,y3,... and SOME relations are conserved
KMRelat (2003): introducing relations
Ex: Length k with quorum 3
Length k:
occurs at y1,y2,y3,... and SOME relations are conserved
There are still O(k2) relations to be checked.. per each occurrence... and at each step...
Why KMR+overlap
It takes O(k) steps, each one taking O(n), hence O(kn) [regardless the degenerate alphabet].
Possible alternatives: KMR would take O(log k) steps with step i
concatenating two 2i-motifs and checking (2i)2 relations, that is
i=1 n * 22i = ... = O(k2n). With an in depth approach (not KMR-like) it would
take O(n) steps where at step i an i-motif is extended of one position and i relations are checked, that is
i=1 n * i = O(k2n).
(log2 k)-1
k-1
KMRoverlap and relations
Inferring relational k-motifs with degenerate alphabets performing overlap steps:
Maximal motifs still suffice. No need to explicitely store relations: the
extents suffices still. Relations refine the query and thus reduce
the search space and the output size. More sensitive motifs inference.
KMRoverlap: sketch of the algorithm
l:=1;REPEATOverlap two relational l-motifs;Check relations and generate as many relational
(l+d)-motifs as conserved ones;Check quorum;Eliminate non maximal;l:=l+d;UNTIL l=k.
KMRoverlap: sketch of the algorithm
l:=1;REPEATOverlap two relational l-motifs;Check relations and generate as many relational
(l+d)-motifs as conserved ones;Check quorum;Eliminate non maximal;l:=l+d;UNTIL l=k.
... but there is a problem...
Pseudomotifs
Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.
1 2 3 4 5 6 7 8 9 0
An example:
xbxcxaxbxcx
b c
aC1
C2
C3
Pseudomotifs
Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.
1 2 3 4 5 6 7 8 9 0
An example:
xbxcxaxbxcx
b c
aC1
C2
C3
The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange..)
Pseudomotifs
Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.
1 2 3 4 5 6 7 8 9 0
An example:
xbxcxaxbxcx
b c
aC1
C2
C3
The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange..)
• C3 C1 C3 has extent {1,5,7}{1,7}
• C3 C2 C3 has extent {1,3,7}{1,7}
Pseudomotifs
Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.
1 2 3 4 5 6 7 8 9 0
An example:
xbxcxaxbxcx
b c
aC1
C2
C3
The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange...)• C3 C1 C3 has extent {1,5,7}{1,7}
• C3 C2 C3 has extent {1,3,7}{1,7}
C3 C1C2 C3 is a pseudomotif... and {1,7} a pseudoextent
Pseudomotifs
Motif’s alphabet: C1={a,b}, C2={b,c}, and C3={x}. Input sequence =xbxcxaxbxc. Quorum q=2 and length k=3.
1 2 3 4 5 6 7 8 9 0
An example:
xbxcxaxbxcx
b c
aC1
C2
C3
The extent {1,7} corresponds to xbx occurring twice (so far so good...)and corresponding to the motif C3 C1C2 C3 (strange...)• C3 C1 C3 has extent {1,5,7}{1,7}
• C3 C2 C3 has extent {1,3,7}{1,7}
C3 C1C2 C3 is a pseudomotif... and {1,7} a pseudoextent
The extent {2,8} corresponds to C1C2 C3 C2, but it is also the extent of the 3-motif C2 C3 C2 {2,8} is not a pseudoextent.
Pseudomotifs
Why are pseudomotifs dangerous? The overlap of two maximal (k-d)-motifs can
generate a pseudomotif. There can be O(2|G|k) distinct pseudomotifs of length
k. Pseudomotifs can never be maximal. Thus:
They will never have to be output. They will never be useful to generate longer motifs.
We need to find a way to avoid generating them...
Length k-dLength k
Storing prefixes and suffixes
prefix
suffix
also of inherited motifs
prefix
suffix
Length k-dLength k
Storing prefixes and suffixes
prefix
suffix
also of inherited motifs
prefix
suffix
The extent of is included in that of
Length k-dLength k
Storing prefixes and suffixes
prefix
suffix
also of inherited motifs
prefix
suffix
The extent of is included in that of
Hence is eliminated and inheritates it.
Length k-dLength k
Storing prefixes and suffixes
prefix
suffix
also of inherited motifs
prefix
suffix
The extent of is included in that of
Hence is eliminated and inheritates it.
Length k-dLength k
Storing prefixes and suffixes
prefixes
suffixes
also of inherited motifs
The extent of is included in that of
Hence is eliminated and inheritates it.
Avoiding pseudomotifs
Length k-d
d
d
with prefixes in the set P
with prefixes in the set P
with suffixes in the set S
with suffixes in the set S
Avoiding pseudomotifs
Length k-d
d
d
with prefixes in the set P
with prefixes in the set P
with suffixes in the set S
with suffixes in the set S
Length kis generated iff S P
Some interesting properties
The prefix-suffix condition avoids generating exactly pseudomotifs!
It is enough that ONE maximal motif inheritates the prefix and suffix of a discarded one:
Only (or ) inheritates .
The KMRoverlap algorithm
endwhile;Output all left motifs.
l := 1;while l < k do
for each l-motif occurring at x and l-motif occurring at x+d do:
If S P then generate a per each different set of conserved relations;
Eliminate extents that are < q;Eliminate nonmaximal extents;
l := l + d;
KMRoverlapEx: Length k with quorum 2
Length k-d:
d...
...
...
check relations and generate asmany as relations sets.
Overlap and :
KMRoverlapEx: Length k with quorum 2
Length k-d:
d...
...
...
Length k:
occurs twice
occurs once
check relations and generate asmany as relations sets.
Overlap and :
KMRoverlapEx: Length k with quorum 2
Length k-d: Length k:
occurs twice
occurs once
Overlap and
...
...
occurs twice
KMRoverlapEx: Length k with quorum 2
Length k-d: Length k:
occurs twice
occurs once
Overlap and
occurs twice
occurs twice
occurs twice
......
...
...
KMRoverlapEx: Length k with quorum 2
Length k:
occurs twice
occurs once Check quorumoccurs twice
occurs twice
occurs twice
KMRoverlapEx: Length k with quorum 2
Length k:
occurs twice
They are all maximal(just a coincidence to simplify!)
occurs twice
occurs twice
occurs twice
Complexity
O(k) steps. At each step i there are O(ngl) motifs of length
l. Generating new motifs takes O(n gl). Detecting possible inclusions takes O(n g2l).
Overall complexity in O(k n g2k), [linear w.r.t. input size but still looks bad]
but it is a very rough approximation...
To be precise...
There are two degeneracies: g and g’ At each step i there are O(ngl) motifs of length l.
Generating new motifs takes O(n gl + (g’)2l). Detecting possible inclusions takes O(n (g+(g’)2l)2).
Overall complexity in O(k n (g+(g’)2k)2),[linear wrt input size but exponential in k] but it even a more rough approximation...
KMRoverlap: correctness and completeness
The algorithm is correct (it generates ONLY maximal k-motifs) because: Non maximal are discarded. It stops when k-motifs are generated.
The algorithm is complete (it generates ALL maximal k-motifs) because: Overlapping two maximal (k-d)-motifs is enough to generate all maximal k-motifs. The prefix-suffix condition only discards pseudomotifs.
Preliminary tests
8973 0 124
81757 76953 1110
550911 1881241 936
165727 1673186 167
12502 15668 27
4 5
3 4
6 7
5 6
7 8
generatedmotifs
avoidedpseudo-motifs
maximalmotifsoverlap step
k = 8d = 1q = 5n ~103
• As expected there are many pseudo-motifs.
• Alhought they have the same theoretical upper bound, the number of maximal k-motifs is sensibly smaller than that of the k-motifs.