Post on 08-May-2015
description
Mining Frequent Closed Graphs on Evolving Data Streams
A. Bifet, G. Holmes, B. Pfahringer and R. Gavalda
University of WaikatoHamilton, New Zealand
Laboratory for Relational Algorithmics, Complexity and Learning LARCAUPC-Barcelona Tech, Catalonia
San Diego, 24 August 201117th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining 2011
Mining Evolving Graph Data StreamsProblemGiven a data stream D of graphs, find frequent closed graphs.
Transaction Id Graph
1
C C S N
O
O
2
C C S N
O
C
3 C C S N
N
We provide three algorithms,of increasing power
IncrementalSliding WindowAdaptive
2 / 48
Non-incrementalFrequent Closed Graph Mining
CloseGraph: Xifeng Yan, Jiawei Hanright-most extension based on depth-first searchbased on gSpan ICDM’02
MoSS: Christian Borgelt, Michael R. Bertholdbreadth-first searchbased on MoFa ICDM’02
3 / 48
Outline
1 Streaming Data
2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER
3 Experimental Evaluation
4 Summary and Future Work
4 / 48
Outline
1 Streaming Data
2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER
3 Experimental Evaluation
4 Summary and Future Work
5 / 48
Mining Massive Data
Source: IDC’s Digital Universe Study (EMC), June 2011
6 / 48
Data Streams
Data StreamsSequence is potentially infiniteHigh amount of dataHigh speed of arrivalOnce an element from a data stream has been processedit is discarded or archived
Tools:approximationrandomization, samplingsketching
7 / 48
Data Streams Approximation Algorithms
1011000111 1010101
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
8 / 48
Data Streams Approximation Algorithms
10110001111 0101011
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
8 / 48
Data Streams Approximation Algorithms
101100011110 1010111
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
8 / 48
Data Streams Approximation Algorithms
1011000111101 0101110
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
8 / 48
Data Streams Approximation Algorithms
10110001111010 1011101
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
8 / 48
Data Streams Approximation Algorithms
101100011110101 0111010
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
8 / 48
Outline
1 Streaming Data
2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER
3 Experimental Evaluation
4 Summary and Future Work
9 / 48
Pattern Mining
Dataset ExampleDocument Patterns
d1 abced2 cded3 abced4 acded5 abcded6 bcd
10 / 48
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequentd1,d2,d3,d4,d5,d6 c
d1,d2,d3,d4,d5 e,ced1,d3,d4,d5 a,ac,ae,aced1,d3,d5,d6 b,bc
d2,d4,d5 d,cdd1,d3,d5 ab,abc,abe
be,bce,abced2,d4,d5 de,cde
11 / 48
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent6 c5 e,ce4 a,ac,ae,ace4 b,bc4 d,cd3 ab,abc,abe
be,bce,abce3 de,cde
12 / 48
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce3 de,cde de cde
12 / 48
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
12 / 48
Outline
1 Streaming Data
2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER
3 Experimental Evaluation
4 Summary and Future Work
13 / 48
Graph DatasetTransaction Id Graph Weight
1
C C S N
O
O 1
2
C C S N
O
C 1
3
C S N
O
C 1
4 C C S N
N
1
14 / 48
Graph Coresets
Coreset of a set P with respect to some problemSmall subset that approximates the original set P.
Solving the problem for the coreset provides anapproximate solution for the problem on P.
δ -tolerance Closed GraphA graph g is δ -tolerance closed if none of its proper frequentsupergraphs has a weighted support ≥ (1−δ ) ·support(g).
Maximal graph: 1-tolerance closed graphClosed graph: 0-tolerance closed graph.
15 / 48
Graph Coresets
Coreset of a set P with respect to some problemSmall subset that approximates the original set P.
Solving the problem for the coreset provides anapproximate solution for the problem on P.
δ -tolerance Closed GraphA graph g is δ -tolerance closed if none of its proper frequentsupergraphs has a weighted support ≥ (1−δ ) ·support(g).
Maximal graph: 1-tolerance closed graphClosed graph: 0-tolerance closed graph.
15 / 48
Graph Coresets
Relative support of a closed graphSupport of a graph minus the relative support of its closedsupergraphs.
The sum of the closed supergraphs’ relative supports of agraph and its relative support is equal to its own support.
(s,δ )-coreset for the problem of computing closedgraphsWeighted multiset of frequent δ -tolerance closed graphs withminimum support s using their relative support as a weight.
16 / 48
Graph Coresets
Relative support of a closed graphSupport of a graph minus the relative support of its closedsupergraphs.
The sum of the closed supergraphs’ relative supports of agraph and its relative support is equal to its own support.
(s,δ )-coreset for the problem of computing closedgraphsWeighted multiset of frequent δ -tolerance closed graphs withminimum support s using their relative support as a weight.
16 / 48
Graph DatasetTransaction Id Graph Weight
1
C C S N
O
O 1
2
C C S N
O
C 1
3
C S N
O
C 1
4 C C S N
N
1
17 / 48
Graph Coresets
Graph Relative Support SupportC C S N 3 3
C S N
O
3 3
C S
N
3 3
Table: Example of a coreset with minimum support 50% and δ = 1
18 / 48
Graph Coresets
Figure: Number of graphs in a (40%,δ )-coreset for NCI.
19 / 48
Outline
1 Streaming Data
2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER
3 Experimental Evaluation
4 Summary and Future Work
20 / 48
INCGRAPHMINER
INCGRAPHMINER(D,min sup)
Input: A graph dataset D, and min sup.Output: The frequent graph set G.
1 G← /02 for every batch bt of graphs in D3 do C← CORESET(bt ,min sup)4 G← CORESET(G∪C,min sup)5 return G
21 / 48
WINGRAPHMINER
WINGRAPHMINER(D,W ,min sup)
Input: A graph dataset D, a size window W and min sup.Output: The frequent graph set G.
1 G← /02 for every batch bt of graphs in D3 do C← CORESET(bt ,min sup)4 Store C in sliding window5 if sliding window is full6 then R← Oldest C stored in sliding window,
negate all support values7 else R← /08 G← CORESET(G∪C∪R,min sup)9 return G
22 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1 W1 = 01010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 10 W1 = 1010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 101 W1 = 010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1010 W1 = 10110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 10101 W1 = 0110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 101010 W1 = 110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1010101 W1 = 10111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 10101011 W1 = 0111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111 |µW0− µW1 | ≥ εc : CHANGE DET.!
W0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 01010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdow
TheoremAt every time step we have:
1 (False positive rate bound). If µt remains constant withinW, the probability that ADWIN shrinks the window at thisstep is at most δ .
2 (False negative rate bound). Suppose that for somepartition of W in two parts W0W1 (where W1 contains themost recent items) we have |µW0−µW1 |> 2εc . Then withprobability 1−δ ADWIN shrinks W to W1, or shorter.
ADWIN tunes itself to the data stream at hand, with no need forthe user to hardwire or precompute parameters.
24 / 48
Algorithm ADaptive Sliding WINdow
ADWIN using a Data Stream Sliding Window Model,can provide the exact counts of 1’s in O(1) time per point.tries O(logW ) cutpointsuses O(1
εlogW ) memory words
the processing time per example is O(logW ) (amortizedand worst-case).
Sliding Window Model
1010101 101 11 1 1Content: 4 2 2 1 1Capacity: 7 3 2 1 1
25 / 48
ADAGRAPHMINER
ADAGRAPHMINER(D,Mode,min sup)
1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06 if Mode is Sliding Window7 then Store C in sliding window8 if ADWIN detected change9 then R← Batches to remove
in sliding windowwith negative support
10 G← CORESET(G∪C∪R,min sup)11 if Mode is Sliding Window12 then Insert # closed graphs into ADWIN13 else for every g in G update g’s ADWIN14 return G
26 / 48
ADAGRAPHMINER
ADAGRAPHMINER(D,Mode,min sup)
1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06789
10 G← CORESET(G∪C∪R,min sup)111213 for every g in G update g’s ADWIN14 return G
26 / 48
ADAGRAPHMINER
ADAGRAPHMINER(D,Mode,min sup)
1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06 if Mode is Sliding Window7 then Store C in sliding window8 if ADWIN detected change9 then R← Batches to remove
in sliding windowwith negative support
10 G← CORESET(G∪C∪R,min sup)11 if Mode is Sliding Window12 then Insert # closed graphs into ADWIN1314 return G
26 / 48
Outline
1 Streaming Data
2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER
3 Experimental Evaluation
4 Summary and Future Work
27 / 48
Experimental Evaluation
ChemDB datasetPublic dataset4 million moleculesInstitute for Genomics and Bioinformatics at the Universityof California, Irvine
Open NCI DatabasePublic domain250,000 structuresNational Cancer Institute
28 / 48
What is MOA?{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.
It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:
classificationclustering
Easy to extendEasy to design and run experiments
29 / 48
WEKA: the bird
30 / 48
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
31 / 48
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
31 / 48
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
31 / 48
Classification Experimental Setting
32 / 48
Classification Experimental Setting
ClassifiersNaive BayesDecision stumpsHoeffding TreeHoeffding Option TreeBagging and BoostingADWIN Bagging andLeveraging Bagging
Prediction strategiesMajority classNaive Bayes LeavesAdaptive Hybrid
33 / 48
Clustering Experimental Setting
34 / 48
Clustering Experimental Setting
ClusterersStreamKM++CluStreamClusTreeDen-StreamCobWeb
35 / 48
Clustering Experimental SettingInternal measures External measuresGamma Rand statisticC Index Jaccard coefficientPoint-Biserial Folkes and Mallow IndexLog Likelihood Hubert Γ statisticsDunn’s Index Minkowski scoreTau PurityTau A van Dongen criterionTau C V-measureSomer’s Gamma CompletenessRatio of Repetition HomogeneityModified Ratio of Repetition Variation of informationAdjusted Ratio of Clustering Mutual informationFagan’s Index Class-based entropyDeviation Index Cluster-based entropyZ-Score Index PrecisionD Index RecallSilhouette coefficient F-measure
Table: Internal and external clustering evaluation measures.36 / 48
Cluster Mapping Measure
Hardy Kremer, Philipp Kranen, Timm Jansen, ThomasSeidl, Albert Bifet, Geoff Holmes and Bernhard Pfahringer.An Effective Evaluation Measure for Clustering on EvolvingData StreamKDD’11
CMM: Cluster Mapping MeasureA novel evaluation measure for stream clustering on evolvingstreams
CMM(C ,C L ) = 1− ∑o∈F w(o) ·pen(o,C)
∑o∈F w(o) ·con(o,Cl(o))
37 / 48
Extensions of MOA
Multi-label ClassificationActive LearningRegressionClosed Frequent Graph MiningTwitter Sentiment Analysis
Challenges for bigger data streamsSampling and distributed systems (Map-Reduce, Hadoop, S4)
38 / 48
Open NCI dataset
Time NCI Dataset
0
20
40
60
80
100
120
10.0
00
30.0
00
50.0
00
70.0
00
90.0
00
110.
000
130.
000
150.
000
170.
000
190.
000
210.
000
230.
000
250.
000
Instances
Se
co
nd
s
IncGraphMiner IncGraphMiner-C MoSS closeGraph
39 / 48
Open NCI dataset
Memory NCI Dataset
0100020003000400050006000700080009000
10.0
00
30.0
00
50.0
00
70.0
00
90.0
00
110.
000
130.
000
150.
000
170.
000
190.
000
210.
000
230.
000
250.
000
Instances
Me
ga
by
tes
IncGraphMiner IncGraphMiner-C MoSS closeGraph
40 / 48
ChemDB dataset
Memory ChemDB Dataset
05000
100001500020000250003000035000400004500050000
10.0
00
240.
000
470.
000
700.
000
930.
000
1.16
0.00
0
1.39
0.00
0
1.62
0.00
0
1.85
0.00
0
2.08
0.00
0
2.31
0.00
0
2.54
0.00
0
2.77
0.00
0
3.00
0.00
0
3.23
0.00
0
3.46
0.00
0
3.69
0.00
0
3.92
0.00
0
Instances
Me
ga
by
tes
IncGraphMiner IncGraphMiner-C MoSS closeGraph
41 / 48
ChemDB dataset
Time ChemDB Dataset
0500
100015002000250030003500400045005000
10.0
00
240.
000
470.
000
700.
000
930.
000
1.16
0.00
0
1.39
0.00
0
1.62
0.00
0
1.85
0.00
0
2.08
0.00
0
2.31
0.00
0
2.54
0.00
0
2.77
0.00
0
3.00
0.00
0
3.23
0.00
0
3.46
0.00
0
3.69
0.00
0
3.92
0.00
0
Instances
Se
co
nd
s
IncGraphMiner IncGraphMiner-C MoSS closeGraph
42 / 48
ADAGRAPHMINER
0
10
20
30
40
50
60
10.0
00
60.0
00
110.
000
160.
000
210.
000
260.
000
310.
000
360.
000
410.
000
460.
000
510.
000
560.
000
610.
000
660.
000
710.
000
760.
000
810.
000
860.
000
910.
000
960.
000
Instances
Nu
mb
er
of
Clo
se
d G
rap
hs
ADAGRAPHMINER ADAGRAPHMINER-Window IncGraphMiner
43 / 48
Outline
1 Streaming Data
2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER
3 Experimental Evaluation
4 Summary and Future Work
44 / 48
SummaryMining Evolving Graph Data StreamsGiven a data stream D of graphs, find frequent closed graphs.
Transaction Id Graph
1
C C S N
O
O
2
C C S N
O
C
3 C C S N
N
We provide three algorithms,of increasing power
IncrementalSliding WindowAdaptive
45 / 48
Summary{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.
http://moa.cs.waikato.ac.nz
It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:
classificationclusteringfrequent pattern mining
MOA deals with evolving data streamsMOA is easy to use and extend
46 / 48
Future work
Structured massive data miningsampling, sketchingparallel techniques
47 / 48
Thanks!
48 / 48