Mining Adaptively Frequent Closed Unlabeled Rooted Trees ...abifet/KDD08-Slides.pdf · Mining...
Transcript of Mining Adaptively Frequent Closed Unlabeled Rooted Trees ...abifet/KDD08-Slides.pdf · Mining...
Mining Adaptively Frequent Closed UnlabeledRooted Trees in Data Streams
Albert Bifet and Ricard Gavaldà
Universitat Politècnica de Catalunya
14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD’08)
2008 Las Vegas, USA
Data StreamsSequence is potentiallyinfiniteHigh amount of data:sublinear spaceHigh speed of arrival:sublinear time perexample
Tree MiningMining frequent trees isbecoming an importanttaskApplications:
chemical informaticscomputer visiontext retrievalbioinformaticsWeb analysis.
Many link-basedstructures may bestudied formally bymeans of unorderedtrees
Introduction: Trees
Our trees are:RootedUnlabeledOrdered and Unordered
Our subtrees are:Induced
Two different ordered treesbut the same unordered tree
Introduction
What Is Tree Pattern Mining?
Given a dataset of trees, find the complete set of frequentsubtrees
Frequent Tree Pattern (FS):
Include all the trees whose support is no less than min_sup
Closed Frequent Tree Pattern (CS):
Include no tree which has a super-tree with the samesupport
CS ⊆ FSClosed Frequent Tree Mining provides a compactrepresentation of frequent trees without loss of information
Introduction
Unordered Subtree Mining
A: B: X: Y:X: Y:
D = {A,B},min_sup = 2
# Closed Subtrees : 2# Frequent Subtrees: 9
Closed Subtrees: X, Y
Frequent Subtrees:
Introduction
ProblemGiven a data stream D of rooted, unlabelled and unorderedtrees, find frequent closed trees.
D
We provide three algorithms,of increasing power
IncrementalSliding WindowAdaptive
Relaxed Support
Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,Yunfeng Liu and Kunqing Xie.CLAIM: An Efficient Method for Relaxed Frequent ClosedItemsets Mining over Stream Data
Linear Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = (n− i)∗ εr ≥ 0,ui = (n− i +1)∗ εr ≤ 1 and i ≤ n.Linear Relaxed closed subpattern t : if and only if thereexists no proper superpattern t ′ of t such that their suportsbelong to the same interval Ii .
Relaxed Support
As the number of closed frequent patterns is not linear withrespect support, we introduce a new relaxed support:
Logarithmic Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = dc ie, ui = dc i+1−1eand i ≤ n.Logarithmic Relaxed closed subpattern t : if and only ifthere exists no proper superpattern t ′ of t such that theirsuports belong to the same interval Ii .
Galois Lattice of closed set of trees
D
We needa Galoisconnection paira closure operator
1 2 3
12 13 23
123
1 2 3
12 13 23
123
Algorithms
AlgorithmsIncremental: INCTREENAT
Sliding Window: WINTREENAT
Adaptive: ADATREENAT Uses ADWIN to monitor change
ADWIN
An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.
ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates
Experimental Validation: TN1
INCTREENAT
CMTreeMiner
Time(sec.)
Size (Milions)2 4 6 8
100
200
300
Figure: Time on experiments on ordered trees on TN1 dataset
Experimental Validation
5
15
25
35
45
0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140
Number of Samples
Nu
mb
er
of
Clo
se
d T
ree
s
AdaTreeInc 1
AdaTreeInc 2
Figure: Number of closed trees maintaining the same number ofclosed datasets on input data