CSE300

The FP-Growth/Apriori Debate

Jeffrey R. Ellis

CSE 300 – 01

April 11, 2002

Presentation Overview

FP-Growth Algorithm Refresher & Example Motivation

FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work?

Real World Application Datasets are not created equal Results of real-world implementation

FP-Growth Algorithm

Association Rule Mining Generate Frequent Itemsets

Apriori generates candidate sets FP-Growth uses specialized data structures (no

candidate sets)

Find Association Rules Outside the scope of both FP-Growth & Apriori

Therefore, FP-Growth is a competitor to Apriori

FP-Growth Example

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

FP-Growth Example

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem

FP-Growth Algorithm

FP-CreateTreeInput: DB, min_supportOutput: FP-Tree1. Scan DB & count all frequent

items.2. Create null root & set as current

node.3. For each Transaction T

Sort T’s items. For each sorted Item I

Insert I into tree as a child of current node.

Connect new tree node to header list.

Two passes through DB

Tree creation is based on number of items in DB.

Complexity of CreateTree is O(|DB|)

FP-Growth Example

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

FP-Growth Algorithm

FP-GrowthInput: FP-Tree, f_is (frequent

itemset)Output: All freq_patterns

if FP-Tree contains single Path Pthen for each combination of nodes in P

generate pattern f_is combinationelse for each item i in header

generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (FP-Tree 0)then call FP-Growth (cFP-Tree, pattern)

Recursive algorithm creates FP-Tree structures and calls FP-Growth

Claims no candidate generation

Two-Part Algorithm

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is combination

e.g., { A B C D } p = |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD

(n=1 to p-1) (p-1Cn)

A

B

C

D

Two-Part Algorithmelse for each item i in header

generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (cFP-Tree null)then call FP-Growth (cFP-Tree, pattern)

e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null

Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.

A

B

C

D

D

C D

E

E

E

FP-Growth Complexity

Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.

Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)

Creation of a new cFP-Tree occurs also.

Sample Data FP-Treenull

K

M

EDC

DC

L

B

A

…

J

L M

M

ML

ML M

M

K

…

…

…

M

ML

ML M

M

K

…

Algorithm Results (in English)

Candidate Generation sets exchanged for FP-Trees. You MUST take into account all paths that

contain an item set with a test item. You CANNOT determine before a

conditional FP-Tree is created if new frequent item sets will occur.

Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

Header:ABFGHI

M

FP-Growth Mining ExampleNull

M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50

Transactions:A (49)B (44)F (39)G (34)H (30)I (20)

ABGMBGIMFGIMGHIM

H:30 I:20


M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Apattern = { A M }support = 1pattern basecFP-Tree = null

freq_itemset = { M }


M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Bpattern = { B M }support = 2pattern basecFP-Tree …


FP-Growth Mining Example

M:2

B:2

G:2Header:GBM

Patterns mined:BMGM

BGM

freq_itemset = { B M }

All patterns: BM, GM, BGM


M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Fpattern = { F M }support = 1pattern basecFP-Tree = null



M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Gpattern = { G M }support = 4pattern basecFP-Tree …



G:1

G:2B:1

I:3

M:2 M:1

G:1

B:1Header:

IBGM

i = Ipattern = { I G M }support = 3pattern basecFP-Tree …

freq_itemset = { G M }

M:1


M:3

G:3

I:3Header:IGM

Patterns mined:IMGMIGM

freq_itemset = { I G M }

All patterns: BM, GM, BGM, IM, GM, IGM


G:1

G:2B:1

I:3

M:2 M:1

G:1

B:1Header:

IBGM

i = Bpattern = { B G M }support = 2pattern basecFP-Tree …

freq_itemset = { G M }

M:1


M:2

G:2

B:2Header:BGM

Patterns mined:BMGM

BGM

freq_itemset = { B G M }

All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM


M:1

I:1

G:1

B:45

M:1

I:1

G:1

F:40

M:1

I:1

H:1

G:35

M:1

G:1

B:1

A:50Header:

ABFGHI

M

i = Hpattern = { H M }support = 1pattern basecFP-Tree = null



Complete pass for { M } Move onto { I }, { H }, etc. Final Frequent Sets:

L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }

L3 : { BGM, GIM, BGM, GIM } L4 : None

FP-Growth Redundancy v. Apriori Candidate Sets FP-Growth (support=2) generates:

L2 : 10 sets (5 distinct) L3 : 4 sets (2 distinct) Total : 14 sets

Apriori (support=2) generates: C2 : 21 sets C3 : 2 sets Total : 23 sets

What about support=1? Apriori : 23 sets FP-Growth : 28 sets in {M} with i = A, B, F alone!

FP-Growth vs. Apriori

Apriori visits each transaction when generating a new candidate sets; FP-Growth does not Can use data structures to reduce

transaction list

FP-Growth traces the set of concurrent items; Apriori generates candidate sets

FP-Growth uses more complicated data structures & mining techniques

Algorithm Analysis Results

FP-Growth IS NOT inherently faster than Apriori Intuitively, it appears to condense data Mining scheme requires some new work to

replace candidate set generation Recursion obscures the additional effort

FP-Growth may run faster than Apriori in circumstances

No guarantee through complexity which algorithm to use for efficiency

Improvements to FP-Growth

None currently reported MLFPT

Multiple Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors

No reports of complexity analysis or accuracy of FP-Growth

Real World Applications

Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” Collected implementations of Apriori, FP-

Growth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial

and 3 real data sets Time-based comparisons generated

Apriori & FP-Growth

Apriori Implementation from creator Christian

Borgelt (GNU Public License) C implementation Entire dataset loaded into memory

FP-Growth Implementation from creators Han & Pei Version – February 5, 2001

Other Algorithms

CHARM Based on concept of Closed Itemset e.g., { A, B, C } – ABC, AC, BC,

ACB, AB, CB, etc.

CLOSET Han, Pei implementation of Closed Itemset

MagnumOpus Generates rules directly through search-

and-prune technique

Datasets

IBM-Artificial Generated at IBM Almaden (T10I4D100K) Often used in association rule mining

studies

BMS-POS Years of point-of-sale data from retailer

BMS-WebView-1 & BMS-WebView-2 Months of clickstream traffic from e-

commerce web sites

Dataset Characteristics

Experimental Considerations

Hardware Specifications Dual 550MHz Pentium III Xeon processors 1GB Memory

Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }

Confidence = 0% No other applications running (second

processor handles system processes)

IBM-Artificial BMS-POS

BMS-WebView-1 BMS-WebView-2

Study Results – Real Data

At support 0.20%, Apriori performs as fast as or better than FP-Growth

At support < 0.20%, Apriori completes whenever FP-Growth completes Exception – BMS-WebView-2 @ 0.01%

When 2 million rules are generated, Apriori finishes in 10 minutes or less Proposed – Bottleneck is NOT the rule

algorithm, but rule analysis

Real Data Results

Algorithm Support Time Rules Time Rules Time Rules

Apriori 186m Failed FailedFP-Growth 120m Failed 13m 12s

Apriori 16m 9 s Failed 58sFP-Growth 10m 41s Failed 29s

Apriori 8m 35s 1m 50s 28sFP-Growth 6m 7s 52s 16s

Apriori 3m 58s 1.2s 9.1sFP-Growth 3m 12s 1.2s 5.9s

Apriori 1m 14s 0.4s 2.4sFP-Growth 1m 35s 0.7s 2.3s

BMS-POS BMS-WebView-1 BMS-WebView-2

214,300,568 Falied Failed0.01

0.04 5,061,105

0.06 1,837,824

0.10

0.20

530,353

103,449

Failed

3,011,836

10,360

1,516

1,096,720

510,233

119,335

12,665

Study Results – Artificial Data

At support < 0.40%, FP-Growth performs MUCH faster than Apriori

At support 0.40%, FP-Growth and Apriori are comparable

Support Time Rules Support Time Rules Support Time RulesApriori 4m 4s 1m 1s 44sFP-Growth 20s 9.2s 8.2s

Apriori 34s 20s 5.7sFP-Growth 7.1 5.8s 4.3s

1,376,684 0.04 56,962 0.06 41,215

0.10 26,962 0.20 13,151 0.40 1.997

0.01

Real-World Study Conclusions

FP-Growth (and other non-Apriori) perform better on artificial data

On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets

FP-Growth may be suitable when low support, large result count, fast generation are needed

Future research may best be directed toward analyzing association rules

Research Conclusions

FP-Growth does not have a better complexity than Apriori Common sense indicates it will run faster

FP-Growth does not always have a better running time than Apriori Support, dataset appears more influential

FP-Trees are very complex structures (Apriori is simple)

Location of data (memory vs. disk) is non-factor in comparison of algorithms

To Use or Not To Use?

Question: Should the FP-Growth be used in favor of Apriori? Difficulty to code High performance at extreme cases Personal preference

More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules?

References Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without

Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.

Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.

Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000.

Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.

Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001

Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.

Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.

CSE300

Documents

Transcript of CSE300