CSE300

44
The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002

description

TRDESFG

Transcript of CSE300

Page 1: CSE300

The FP-Growth/Apriori Debate

Jeffrey R. Ellis

CSE 300 – 01

April 11, 2002

Page 2: CSE300

Presentation Overview

FP-Growth Algorithm Refresher & Example Motivation

FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work?

Real World Application Datasets are not created equal Results of real-world implementation

Page 3: CSE300

FP-Growth Algorithm

Association Rule Mining Generate Frequent Itemsets

Apriori generates candidate sets FP-Growth uses specialized data structures (no

candidate sets)

Find Association Rules Outside the scope of both FP-Growth & Apriori

Therefore, FP-Growth is a competitor to Apriori

Page 4: CSE300

FP-Growth Example

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Page 5: CSE300

FP-Growth Example

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem

Page 6: CSE300

FP-Growth Algorithm

FP-CreateTreeInput: DB, min_supportOutput: FP-Tree1. Scan DB & count all frequent

items.2. Create null root & set as current

node.3. For each Transaction T

Sort T’s items. For each sorted Item I

Insert I into tree as a child of current node.

Connect new tree node to header list.

Two passes through DB

Tree creation is based on number of items in DB.

Complexity of CreateTree is O(|DB|)

Page 7: CSE300

FP-Growth Example

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Page 8: CSE300

FP-Growth Algorithm

FP-GrowthInput: FP-Tree, f_is (frequent

itemset)Output: All freq_patterns

if FP-Tree contains single Path Pthen for each combination of nodes in P

generate pattern f_is combinationelse for each item i in header

generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (FP-Tree 0)then call FP-Growth (cFP-Tree, pattern)

Recursive algorithm creates FP-Tree structures and calls FP-Growth

Claims no candidate generation

Page 9: CSE300

Two-Part Algorithm

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is combination

e.g., { A B C D } p = |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD

(n=1 to p-1) (p-1Cn)

A

B

C

D

Page 10: CSE300

Two-Part Algorithmelse for each item i in header

generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (cFP-Tree null)then call FP-Growth (cFP-Tree, pattern)

e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null

Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.

A

B

C

D

D

C D

E

E

E

Page 11: CSE300

FP-Growth Complexity

Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.

Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)

Creation of a new cFP-Tree occurs also.

Page 12: CSE300

Sample Data FP-Treenull

K

M

EDC

DC

L

B

A

J

L M

M

ML

ML M

M

K

M

ML

ML M

M

K

Page 13: CSE300

Algorithm Results (in English)

Candidate Generation sets exchanged for FP-Trees. You MUST take into account all paths that

contain an item set with a test item. You CANNOT determine before a

conditional FP-Tree is created if new frequent item sets will occur.

Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

Page 14: CSE300

Header:ABFGHI

M

FP-Growth Mining ExampleNull

M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50

Transactions:A (49)B (44)F (39)G (34)H (30)I (20)

ABGMBGIMFGIMGHIM

H:30 I:20

Page 15: CSE300

FP-Growth Mining ExampleNull

M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Apattern = { A M }support = 1pattern basecFP-Tree = null

freq_itemset = { M }

Page 16: CSE300

FP-Growth Mining ExampleNull

M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Bpattern = { B M }support = 2pattern basecFP-Tree …

freq_itemset = { M }

Page 17: CSE300

FP-Growth Mining Example

M:2

B:2

G:2Header:GBM

Patterns mined:BMGM

BGM

freq_itemset = { B M }

All patterns: BM, GM, BGM

Page 18: CSE300

FP-Growth Mining ExampleNull

M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Fpattern = { F M }support = 1pattern basecFP-Tree = null

freq_itemset = { M }

Page 19: CSE300

FP-Growth Mining ExampleNull

M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Gpattern = { G M }support = 4pattern basecFP-Tree …

freq_itemset = { M }

Page 20: CSE300

FP-Growth Mining ExampleNull

G:1

G:2B:1

I:3

M:2 M:1

G:1

B:1Header:

IBGM

i = Ipattern = { I G M }support = 3pattern basecFP-Tree …

freq_itemset = { G M }

M:1

Page 21: CSE300

FP-Growth Mining Example

M:3

G:3

I:3Header:IGM

Patterns mined:IMGMIGM

freq_itemset = { I G M }

All patterns: BM, GM, BGM, IM, GM, IGM

Page 22: CSE300

FP-Growth Mining ExampleNull

G:1

G:2B:1

I:3

M:2 M:1

G:1

B:1Header:

IBGM

i = Bpattern = { B G M }support = 2pattern basecFP-Tree …

freq_itemset = { G M }

M:1

Page 23: CSE300

FP-Growth Mining Example

M:2

G:2

B:2Header:BGM

Patterns mined:BMGM

BGM

freq_itemset = { B G M }

All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM

Page 24: CSE300

FP-Growth Mining ExampleNull

M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Hpattern = { H M }support = 1pattern basecFP-Tree = null

freq_itemset = { M }

Page 25: CSE300

FP-Growth Mining Example

Complete pass for { M } Move onto { I }, { H }, etc. Final Frequent Sets:

L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }

L3 : { BGM, GIM, BGM, GIM } L4 : None

Page 26: CSE300

FP-Growth Redundancy v. Apriori Candidate Sets FP-Growth (support=2) generates:

L2 : 10 sets (5 distinct) L3 : 4 sets (2 distinct) Total : 14 sets

Apriori (support=2) generates: C2 : 21 sets C3 : 2 sets Total : 23 sets

What about support=1? Apriori : 23 sets FP-Growth : 28 sets in {M} with i = A, B, F alone!

Page 27: CSE300

FP-Growth vs. Apriori

Apriori visits each transaction when generating a new candidate sets; FP-Growth does not Can use data structures to reduce

transaction list

FP-Growth traces the set of concurrent items; Apriori generates candidate sets

FP-Growth uses more complicated data structures & mining techniques

Page 28: CSE300

Algorithm Analysis Results

FP-Growth IS NOT inherently faster than Apriori Intuitively, it appears to condense data Mining scheme requires some new work to

replace candidate set generation Recursion obscures the additional effort

FP-Growth may run faster than Apriori in circumstances

No guarantee through complexity which algorithm to use for efficiency

Page 29: CSE300

Improvements to FP-Growth

None currently reported MLFPT

Multiple Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors

No reports of complexity analysis or accuracy of FP-Growth

Page 30: CSE300

Real World Applications

Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” Collected implementations of Apriori, FP-

Growth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial

and 3 real data sets Time-based comparisons generated

Page 31: CSE300

Apriori & FP-Growth

Apriori Implementation from creator Christian

Borgelt (GNU Public License) C implementation Entire dataset loaded into memory

FP-Growth Implementation from creators Han & Pei Version – February 5, 2001

Page 32: CSE300

Other Algorithms

CHARM Based on concept of Closed Itemset e.g., { A, B, C } – ABC, AC, BC,

ACB, AB, CB, etc.

CLOSET Han, Pei implementation of Closed Itemset

MagnumOpus Generates rules directly through search-

and-prune technique

Page 33: CSE300

Datasets

IBM-Artificial Generated at IBM Almaden (T10I4D100K) Often used in association rule mining

studies

BMS-POS Years of point-of-sale data from retailer

BMS-WebView-1 & BMS-WebView-2 Months of clickstream traffic from e-

commerce web sites

Page 34: CSE300

Dataset Characteristics

Page 35: CSE300

Experimental Considerations

Hardware Specifications Dual 550MHz Pentium III Xeon processors 1GB Memory

Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }

Confidence = 0% No other applications running (second

processor handles system processes)

Page 36: CSE300

IBM-Artificial BMS-POS

Page 37: CSE300

BMS-WebView-1 BMS-WebView-2

Page 38: CSE300

Study Results – Real Data

At support 0.20%, Apriori performs as fast as or better than FP-Growth

At support < 0.20%, Apriori completes whenever FP-Growth completes Exception – BMS-WebView-2 @ 0.01%

When 2 million rules are generated, Apriori finishes in 10 minutes or less Proposed – Bottleneck is NOT the rule

algorithm, but rule analysis

Page 39: CSE300

Real Data Results

Algorithm Support Time Rules Time Rules Time Rules

Apriori 186m Failed FailedFP-Growth 120m Failed 13m 12s

Apriori 16m 9 s Failed 58sFP-Growth 10m 41s Failed 29s

Apriori 8m 35s 1m 50s 28sFP-Growth 6m 7s 52s 16s

Apriori 3m 58s 1.2s 9.1sFP-Growth 3m 12s 1.2s 5.9s

Apriori 1m 14s 0.4s 2.4sFP-Growth 1m 35s 0.7s 2.3s

BMS-POS BMS-WebView-1 BMS-WebView-2

214,300,568 Falied Failed0.01

0.04 5,061,105

0.06 1,837,824

0.10

0.20

530,353

103,449

Failed

3,011,836

10,360

1,516

1,096,720

510,233

119,335

12,665

Page 40: CSE300

Study Results – Artificial Data

At support < 0.40%, FP-Growth performs MUCH faster than Apriori

At support 0.40%, FP-Growth and Apriori are comparable

Support Time Rules Support Time Rules Support Time RulesApriori 4m 4s 1m 1s 44sFP-Growth 20s 9.2s 8.2s

Apriori 34s 20s 5.7sFP-Growth 7.1 5.8s 4.3s

1,376,684 0.04 56,962 0.06 41,215

0.10 26,962 0.20 13,151 0.40 1.997

0.01

Page 41: CSE300

Real-World Study Conclusions

FP-Growth (and other non-Apriori) perform better on artificial data

On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets

FP-Growth may be suitable when low support, large result count, fast generation are needed

Future research may best be directed toward analyzing association rules

Page 42: CSE300

Research Conclusions

FP-Growth does not have a better complexity than Apriori Common sense indicates it will run faster

FP-Growth does not always have a better running time than Apriori Support, dataset appears more influential

FP-Trees are very complex structures (Apriori is simple)

Location of data (memory vs. disk) is non-factor in comparison of algorithms

Page 43: CSE300

To Use or Not To Use?

Question: Should the FP-Growth be used in favor of Apriori? Difficulty to code High performance at extreme cases Personal preference

More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules?

Page 44: CSE300

References Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without

Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.

Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.

Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000.

Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.

Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001

Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.

Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.