Mining for Tree-Query Associations in a Graph
description
Transcript of Mining for Tree-Query Associations in a Graph
Mining for Tree-Query Associations in a Graph
Jan Van den Bussche Hasselt University, Belgium
joint work with Bart Goethals (U Antwerp, Belgium)and Eveline Hoekx (U Hasselt, Belgium)
2
Graph Data
A (directed) graph over a set of nodes N is a set G of edges: ordered pairs ij with ij N.
Snapshot of a graph representing the metabolic pathway of a human.
Applications: life sciences, biology, social sciences, WWW, ...
3
Graph Mining
Transactional category– dataset: set of many small graphs (transactions)
– frequency: transactions in which the pattern occurs (at least once)
– ILP: Warmr
[AGM, FSG, TreeMiner, gSpan, FFSM, Horvath-Ramon-Wrobel]
Single graph category– dataset: single large graph
– frequency: copies of the pattern in the large graph
[Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom]
Focus on pattern mining, few work on association rule mining!
4
Tree-Query Pattern
• powerful tree-shaped pattern• inspired by conjunctive database queries
• special features:– existential nodes– parameterized nodes
• occurrence of the pattern in G is any homomorphism from the pattern in G
0
8 x frequency: x z: 0z G z8 G zx G
5
Association rules
0
8 x
x1
x3 x2
⇒
(x1,x2,x3) (0, ,8)x
• fully fledged associations over tree-query patterns• example:
6
Experimental results: Real-life datasets
• Food web nodes1 edges30
x1
20
x2
20
frequency = 176
x1
x2
⇒
x1
20
x2
confidence = 89%
7
Experimental results: Real-life datasets
• Food web nodes1 edges30
x1
20
x2
frequency = 176
x1
x2
⇒
x1
20
x2
confidence = 89%
8
Experimental results: Food web
nodes1 edges30
⇒ x4
x3
x2
(x1,x2,x3,x4,x5)
x1
x5
x4
x3
x2
(0,x2,x3,x4,x5)
0
x5
⇒ x4
x3
x2
(x1,x2,x3,x4,x5)
x1
x5
x3
x2
x1
(x1,x2,x3,x4,x5)
0
x4
x5
45% 55%
9
Experimental results: Real-life datasets
• Protein interactions graph nodes211 edges80
x1
x2
⇒
x1
22
x2 confidence = 10%
10
Experimental results: Protein interaction graph
nodes211 edges80
⇒
746
x2
(x1,x2)
x1
(x1,x2)
746
x2
x1
376
90%
11
Outline rest of the talk
• Formal problem definition• Algorithm
– overall approach– levelwise generation of tree patterns– generation of containment mappings– generation of parameter assignments
• Equivalent association rules• Certhia• Performance and Experimental results• Future work
12
Tree pattern
0
8 x
13
Tree pattern
0
8 x
14
Tree pattern
0
8 x
15
Tree pattern
0
8 x
16
Tree pattern
0
8 x
select distinct G3.to as xfrom G G1, G G2, G G3where G1.from=5 and G1.to=G2.from
and G1.to=G3.from and G2.to=8
17
Matching
0
8 x
P:
0
1
5 4
2 3
7
6 8
G: z y z x
18
Matching
0
8 x
P:
0
1
5 4
2 3
7
6 8
G: z y z x
19
Matching
0
1
5 4
2 3
7
6 8
G: z y z x
h1 0 1 8 4
0
8 x
P:
20
Matching
0
1
5 4
2 3
7
6 8
G: z y z x
h 0 1 8 4
h 0 1 8 8
0
8 x
P:
21
Matching
0
1
5 4
2 3
7
6 8
G: z y z x
h 0 1 8 4
h 0 1 8 8
h 0 2 8 4
0
8 x
P:
22
Matching
0
1
5 4
2 3
7
6 8
G: z y z x
h 0 1 8 4
h 0 1 8 8
h 0 2 8 4
h 0 2 8 5
0
8 x
P:
23
Matching
0
1
5 4
2 3
7
6 8
G: z y z x
h 0 1 8 4
h 0 1 8 8
h 0 2 8 4
h 0 2 8 5
h 0 2 8 8
0
8 x
P:
24
Frequency
0
1
5 4
2 3
7
6 8
G: z y z x
h 0 1 8 4
h 0 1 8 8
h 0 2 8 4
h 0 2 8 5
h 0 2 8 8
frequency = 3
0
8 x
P:
25
Tree Query
0
8 x
( , , 8)x x
P, body
H, head
• Q = (H,P)
26
Association Rule
• AR: Q1 Q2
Confidence (AR) = freq(Q2)/freq(Q1) Q2 Q1
x
8 6
0
( , , 6)x x
x1
1
8 x3
0
(x1, x2, x3)
x2
2
⇒
Q1 Q2
{ (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }
27
Examples of Association Rules
x1
x2
(x1, x2)
,) ⇒
x1
8
(x1, 8)
,)
x1
(x1)
,) ⇒
x1
8
(x1)
,)
(1) (2)
28
Association Rule
• AR: Q1 Q2
Confidence (AR) = freq(Q2)/freq(Q1) Q2 Q1
x
8 6
0
( , , 6)x x
x1
1
8 x3
0
(x1, x2, x3)
x2
2
⇒
Q1 Q2
{ (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }
29
Containment Mapping
x
8 6
0
( , , 6)x x
x1
1
8 x3
0
(x1, x2, x3)
x2
2
Q1 Q2
containment mapping
30
Containment Mapping
x
8 6
0
( , , 6)x x
x1
1
8 x3
0
(x1, x2, x3)
x2
2
Q1 Q2
containment mapping
31
Containment Mapping
x
8 6
0
( , , 6)x x
x1
1
8 x3
0
(x1, x2, x3)
x2
2
Q1 Q2
containment mapping
32
Containment Mapping
x
8 6
0
( , , 6)x x
x1
1
8 x3
0
(x1, x2, x3)
x2
2
Q1 Q2
containment mapping
33
Containment Mapping
x
8 6
0
( , , 6)x x
x1
1
8 x3
0
(x1, x2, x3)
x2
2
Q1 Q2
containment mapping
Q Q containment mapping from Q1 to Q2
34
Problem statement: Mining tree queries
Given a graph G and a threshold k, find all tree queries that
have frequency at least k in G, those queries are calledfrequent.
35
Problem statement: Association rules
• Input:– a graph G– minsup
– Qleft frequent in G
– minconf
• Output: All association rules Qleft Q– frequent in G– confident in G.
36
Algorithm: mining tree queries
Outer loop: Generate, incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees.
Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency.
...
x1
x4x3
x2
x2x1
x2x1
x1
2
37
Outer loop
• It is well known how to efficiently generate all trees uniquely up to isomorphism
• Based on canonical form of trees.
• [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]
38
Inner loop: Levelwise approach
• A query Q is characterized by Q set of existential nodes Q set of selected nodes– Labeling Q of the selected nodes by constants.
• Q11 1 1 specializes Q22 2 2 if 12, 1 2 and 1 agrees with 2 on 2.
• If Q1 specializes Q2 then freqQ1 freqQ2
• Most general query: T = (, , )
39
Inner loop: Candidate generation
• CanTab is a candidate queryFreqTab is a frequent query
• Q’=’ ’ is a parent of Q= if either:– ’ and has precisely one more node than
’, or– ’ and has precisely one more node than ’
• Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.
40
Inner loop: Frequency counting
• Each candidacy table can be computed by a single SQL query. (ref. Join lemma).
• Suppose: Gfrom to table in the database, then each frequency table can be computed with a single SQL query.
» formulate in SQL and count
» formulate in SQL E» natural join of E with CanTab
» group by » count each group
41
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
42
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
• Join expression:
CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3
43
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
• Join expression:
CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3
44
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
• Join expression:
CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3
45
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
• Join expression:
CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3
46
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
• Join expression:
CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3
47
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
• SQL expression E for x2
select distinct G1.from as x1, G2.to as x3, G3.to as x4
from G G1, G G2, G G3where G1.to = G2.from and G3.from = G2.from
48
Inner loop: Example
0
8 x
:Q
x1
x2
x3 x4
T: x2
x1 x3
x10 x38
• SQL expression for filling the frequency table:
select distinct E.x1, E.x3, count(E.x4)from E, CanTab{x2}{x1,x3} as CT
where E.x1 = CT.x1 and E.x3 = CT.x3group by E.x1, E.x3having count(E.x4) >= k
49
Algorithm: Mining association rules
Loop 1: Generate incrementally all possible trees T of increasing sizes.
Loop 2: For each T, generate all frequent tree patterns P based T.
Loop 3: For each P, generate all containment mappings f from Pleft to P.
Loop 4: For each f, generate Q=(f(Hleft),P) and all parameter instantiations for Qleft Q.
50
Pattern database
• For each P a table FreqTabP, that contains all frequent parameter instantiations.
Pattern Database
51
Loop 3: Generation of containment mappings
Efficiently solvable, due to tree shape.
52
Loop 4: Generation of parameter
instantiations single relational algebra expression (SQL)
plistσ FreqTabP .freq
FreqTabPleft .freq≥minconf
(σ ϑ left (FreqTab Pleft ) ><θ FreqTab P )
ϑ :σ ∈∑left
∧FreqTab Pleft.σ = FreqTab P . f (σ )
ϑ left :σ ∈∑left
∧FreqTab Pleft.σ = σ left (σ )
plist: all P.σ withσ ∈∑,FreqTabP .freq,
FreqTabP .freqFreqTabPleft
.freq
•
•
•
53
Example: Loop 4
x2
x
1
(x2, x, x)
x
Qleft
1
x2
4 5
P
54
Example: Loop 4
x2
x
1
(x2, x, x)
x
Qleft
1
x2
4 5
Q
(x2, x2, 5)
55
Example: Loop 4
x2
x
1
(x2, x, x)
x
Qleft
1
x2
4 5
Q
(x2, x2, 5)
select freqQleft.x1, freqQleft.x4, freqP.x1, freqP.x4, freqP.x5, freqP.freq, freqP.freq/freqQleft.freqfrom freqP, freqQleftwhere freqQleft.x1=freqP.x1 and freqQleft.x4=freqP.x4and freqP.freq/freqQleft.freq >= minconf
56
Equivalent queries
Queries Q1 and Q2 are equivalent if same result sets on all
graphs G (up to renaming of the distinguished variables)
• 2 cases of equivalent queries:1. Q1 has fewer nodes than Q2
2. Q1 and Q2 have the same number of nodes
57
Equivalence theorem
A containment mapping from Q1 to Q2 is a h: Q1 Q2 that
maps distinguished variables of Q1 one-to-one to distinguished
variables of Q2, and maps selected nodes of Q1 to selected
nodes of Q2, preserving labels
Two queries are equivalent if and only if there are containment mappings between them in both directions.
58
Q2 x1
x3
x2
Case 1: Q1 fewer nodes than Q2
Redundancy lemma: Let Q be a tree query without selected nodes. Then Q has aredundancy if and only if it contains a subtree C in the form of
alinear chain of nodes (possibly just a single node), such that
the parent of C has another subtree that is at least as deep as C.
Q1 x1
x3
x2
Redundantsubtree
59
Case 2: Q1 and Q2 same number of nodes
• Q1 and Q2 must be isomorphic.
• Canonical form of queries: refine the canonical ordering of the underlying unlabeled tree, taking into account node labels.
60
Equivalent Association Rules
(x1, x2, x3, x)
,) ⇒
x1
x3
x2
(x1, x2, x2, x3)
,)
x1
x3
x2
x
(x1, x2, x3, x)
,) ⇒
x1
x3
x2
(x1, x2, x3, x3)
,)
x1
x3
x3
x
(1)
(2)
61
Equivalence detection for rules
• Many cases efficiently checked.• But worst case still as hard as general graph isomorphism checking.
• Fast heuristics for graph isomorphism checking i.e. Nauty
62
Certhia
• Loop 1 + Loop 2: preprocessing step Pattern database
• Loop 3 + Loop 4: interactive browsing tool Certhia Demo session
63
Experimental results: Performance
• Fully implemented on top of IBM DB2
• Preliminary performance results:– adequate performance– huge number of patterns– constant overhead per discovered pattern
64
Performance: Association rules
• Loop 3 and Loop 4: –very fast–constant overhead per rule
65
Future work
• Serious scientific data mining• Loosen restriction to trees
66
Publications
• B. Goethals, E. Hoekx, J. Van den Bussche, “Mining tree queries in a graph”, KDD’05, p 61–69.
• E. Hoekx, J. Van den Bussche, “Mining tree-query associations in a graph”, ICDM’06 regular paper.
• “Certhia: Tree-query mining in large graphs”, ICDM’06 software demo.
• http://alpha.uhasselt.be/~vdbuss