CSE300
-
Upload
tarak-a-positive -
Category
Documents
-
view
215 -
download
0
description
Transcript of CSE300
The FP-Growth/Apriori Debate
Jeffrey R. Ellis
CSE 300 – 01
April 11, 2002
Presentation Overview
FP-Growth Algorithm Refresher & Example Motivation
FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work?
Real World Application Datasets are not created equal Results of real-world implementation
FP-Growth Algorithm
Association Rule Mining Generate Frequent Itemsets
Apriori generates candidate sets FP-Growth uses specialized data structures (no
candidate sets)
Find Association Rules Outside the scope of both FP-Growth & Apriori
Therefore, FP-Growth is a competitor to Apriori
FP-Growth Example
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
FP-Growth Example
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
FP-Growth Algorithm
FP-CreateTreeInput: DB, min_supportOutput: FP-Tree1. Scan DB & count all frequent
items.2. Create null root & set as current
node.3. For each Transaction T
Sort T’s items. For each sorted Item I
Insert I into tree as a child of current node.
Connect new tree node to header list.
Two passes through DB
Tree creation is based on number of items in DB.
Complexity of CreateTree is O(|DB|)
FP-Growth Example
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
FP-Growth Algorithm
FP-GrowthInput: FP-Tree, f_is (frequent
itemset)Output: All freq_patterns
if FP-Tree contains single Path Pthen for each combination of nodes in P
generate pattern f_is combinationelse for each item i in header
generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (FP-Tree 0)then call FP-Growth (cFP-Tree, pattern)
Recursive algorithm creates FP-Tree structures and calls FP-Growth
Claims no candidate generation
Two-Part Algorithm
if FP-Tree contains single Path P
then for each combination of nodes in P
generate pattern f_is combination
e.g., { A B C D } p = |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD
(n=1 to p-1) (p-1Cn)
A
B
C
D
Two-Part Algorithmelse for each item i in header
generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (cFP-Tree null)then call FP-Growth (cFP-Tree, pattern)
e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null
Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.
A
B
C
D
D
C D
E
E
E
FP-Growth Complexity
Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.
Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)
Creation of a new cFP-Tree occurs also.
Sample Data FP-Treenull
K
M
EDC
DC
L
B
A
…
J
L M
M
ML
ML M
M
K
…
…
…
M
ML
ML M
M
K
…
Algorithm Results (in English)
Candidate Generation sets exchanged for FP-Trees. You MUST take into account all paths that
contain an item set with a test item. You CANNOT determine before a
conditional FP-Tree is created if new frequent item sets will occur.
Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.
Header:ABFGHI
M
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50
Transactions:A (49)B (44)F (39)G (34)H (30)I (20)
ABGMBGIMFGIMGHIM
H:30 I:20
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Apattern = { A M }support = 1pattern basecFP-Tree = null
freq_itemset = { M }
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Bpattern = { B M }support = 2pattern basecFP-Tree …
freq_itemset = { M }
FP-Growth Mining Example
M:2
B:2
G:2Header:GBM
Patterns mined:BMGM
BGM
freq_itemset = { B M }
All patterns: BM, GM, BGM
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Fpattern = { F M }support = 1pattern basecFP-Tree = null
freq_itemset = { M }
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Gpattern = { G M }support = 4pattern basecFP-Tree …
freq_itemset = { M }
FP-Growth Mining ExampleNull
G:1
G:2B:1
I:3
M:2 M:1
G:1
B:1Header:
IBGM
i = Ipattern = { I G M }support = 3pattern basecFP-Tree …
freq_itemset = { G M }
M:1
FP-Growth Mining Example
M:3
G:3
I:3Header:IGM
Patterns mined:IMGMIGM
freq_itemset = { I G M }
All patterns: BM, GM, BGM, IM, GM, IGM
FP-Growth Mining ExampleNull
G:1
G:2B:1
I:3
M:2 M:1
G:1
B:1Header:
IBGM
i = Bpattern = { B G M }support = 2pattern basecFP-Tree …
freq_itemset = { G M }
M:1
FP-Growth Mining Example
M:2
G:2
B:2Header:BGM
Patterns mined:BMGM
BGM
freq_itemset = { B G M }
All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Hpattern = { H M }support = 1pattern basecFP-Tree = null
freq_itemset = { M }
FP-Growth Mining Example
Complete pass for { M } Move onto { I }, { H }, etc. Final Frequent Sets:
L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }
L3 : { BGM, GIM, BGM, GIM } L4 : None
FP-Growth Redundancy v. Apriori Candidate Sets FP-Growth (support=2) generates:
L2 : 10 sets (5 distinct) L3 : 4 sets (2 distinct) Total : 14 sets
Apriori (support=2) generates: C2 : 21 sets C3 : 2 sets Total : 23 sets
What about support=1? Apriori : 23 sets FP-Growth : 28 sets in {M} with i = A, B, F alone!
FP-Growth vs. Apriori
Apriori visits each transaction when generating a new candidate sets; FP-Growth does not Can use data structures to reduce
transaction list
FP-Growth traces the set of concurrent items; Apriori generates candidate sets
FP-Growth uses more complicated data structures & mining techniques
Algorithm Analysis Results
FP-Growth IS NOT inherently faster than Apriori Intuitively, it appears to condense data Mining scheme requires some new work to
replace candidate set generation Recursion obscures the additional effort
FP-Growth may run faster than Apriori in circumstances
No guarantee through complexity which algorithm to use for efficiency
Improvements to FP-Growth
None currently reported MLFPT
Multiple Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors
No reports of complexity analysis or accuracy of FP-Growth
Real World Applications
Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” Collected implementations of Apriori, FP-
Growth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial
and 3 real data sets Time-based comparisons generated
Apriori & FP-Growth
Apriori Implementation from creator Christian
Borgelt (GNU Public License) C implementation Entire dataset loaded into memory
FP-Growth Implementation from creators Han & Pei Version – February 5, 2001
Other Algorithms
CHARM Based on concept of Closed Itemset e.g., { A, B, C } – ABC, AC, BC,
ACB, AB, CB, etc.
CLOSET Han, Pei implementation of Closed Itemset
MagnumOpus Generates rules directly through search-
and-prune technique
Datasets
IBM-Artificial Generated at IBM Almaden (T10I4D100K) Often used in association rule mining
studies
BMS-POS Years of point-of-sale data from retailer
BMS-WebView-1 & BMS-WebView-2 Months of clickstream traffic from e-
commerce web sites
Dataset Characteristics
Experimental Considerations
Hardware Specifications Dual 550MHz Pentium III Xeon processors 1GB Memory
Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }
Confidence = 0% No other applications running (second
processor handles system processes)
IBM-Artificial BMS-POS
BMS-WebView-1 BMS-WebView-2
Study Results – Real Data
At support 0.20%, Apriori performs as fast as or better than FP-Growth
At support < 0.20%, Apriori completes whenever FP-Growth completes Exception – BMS-WebView-2 @ 0.01%
When 2 million rules are generated, Apriori finishes in 10 minutes or less Proposed – Bottleneck is NOT the rule
algorithm, but rule analysis
Real Data Results
Algorithm Support Time Rules Time Rules Time Rules
Apriori 186m Failed FailedFP-Growth 120m Failed 13m 12s
Apriori 16m 9 s Failed 58sFP-Growth 10m 41s Failed 29s
Apriori 8m 35s 1m 50s 28sFP-Growth 6m 7s 52s 16s
Apriori 3m 58s 1.2s 9.1sFP-Growth 3m 12s 1.2s 5.9s
Apriori 1m 14s 0.4s 2.4sFP-Growth 1m 35s 0.7s 2.3s
BMS-POS BMS-WebView-1 BMS-WebView-2
214,300,568 Falied Failed0.01
0.04 5,061,105
0.06 1,837,824
0.10
0.20
530,353
103,449
Failed
3,011,836
10,360
1,516
1,096,720
510,233
119,335
12,665
Study Results – Artificial Data
At support < 0.40%, FP-Growth performs MUCH faster than Apriori
At support 0.40%, FP-Growth and Apriori are comparable
Support Time Rules Support Time Rules Support Time RulesApriori 4m 4s 1m 1s 44sFP-Growth 20s 9.2s 8.2s
Apriori 34s 20s 5.7sFP-Growth 7.1 5.8s 4.3s
1,376,684 0.04 56,962 0.06 41,215
0.10 26,962 0.20 13,151 0.40 1.997
0.01
Real-World Study Conclusions
FP-Growth (and other non-Apriori) perform better on artificial data
On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets
FP-Growth may be suitable when low support, large result count, fast generation are needed
Future research may best be directed toward analyzing association rules
Research Conclusions
FP-Growth does not have a better complexity than Apriori Common sense indicates it will run faster
FP-Growth does not always have a better running time than Apriori Support, dataset appears more influential
FP-Trees are very complex structures (Apriori is simple)
Location of data (memory vs. disk) is non-factor in comparison of algorithms
To Use or Not To Use?
Question: Should the FP-Growth be used in favor of Apriori? Difficulty to code High performance at extreme cases Personal preference
More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules?
References Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without
Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.
Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.
Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000.
Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.
Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001
Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.