Top Down FP-Growth for Association Rule Mining

27
1 Top Down FP-Growth for Association Rule Mining By Ke Wang

description

Top Down FP-Growth for Association Rule Mining. By Ke Wang. Introduction. Classically, for rule A  B : support: computed by count ( AB ) frequent --- if pass minimum support threshold confidence: computed by count ( AB ) / count( A ) - PowerPoint PPT Presentation

Transcript of Top Down FP-Growth for Association Rule Mining

Page 1: Top Down FP-Growth for Association Rule Mining

1

Top Down FP-Growth for Association Rule Mining

ByKe Wang

Page 2: Top Down FP-Growth for Association Rule Mining

2

Introduction

• Classically, for rule A B :– support: computed by count( AB )

• frequent --- if pass minimum support threshold

– confidence: computed by count( AB ) / count(A )• confident – if pass minimum confidence

threshold

• How to mine association rules?– find all frequent patterns– generate rules from the frequent

patterns

Page 3: Top Down FP-Growth for Association Rule Mining

3

Introduction

• Limitations of current research– use uniform minimum support

threshold– only use support as pruning measure

• Our contribution– improve efficiency– adopt multiple minimum supports– introduce confidence pruning

Page 4: Top Down FP-Growth for Association Rule Mining

4

Related work -- Frequent pattern

mining• Apriori algorithm– method: use anti-monotone property of

support to do pruning, i.e.• if length k pattern is infrequent, its length

k+1 super-pattern can never be frequent

• FP-growth algorithm--better than Apriori– method:

• build FP-tree to store database• mine FP-tree in bottom-up order

Page 5: Top Down FP-Growth for Association Rule Mining

5

Related work -- Association rule

mining• Fast algorithms trying to

guarantee completeness of frequent patterns

• Parallel algorithms & association rule based query languages

• Various association rule mining problems– multi-level multi-dimension rule– constraints on specific item

Page 6: Top Down FP-Growth for Association Rule Mining

6

TD-FP-Growth for frequent pattern mining• Similar tree structure as FP-growth

– Compressed tree to store the database– nodes on each path of the tree are

globally ordered

• Different mining method VS.FP-growth– FP-growth: bottom-up tree mining – TD-FP-Growth : top-down tree mining

Page 7: Top Down FP-Growth for Association Rule Mining

7

TD-FP-Growth for frequent pattern mining

b: 2

root

b: 1 c: 1

a: 3

e: 1

c: 1e: 1

c: 1

e: 1

Header Table H

FP-tree and header table H

b, ea, b, c, e

b, c, ea, c, d

aminsup = 2

Entry value count side-link

a

b

c

e

3

3

3

3

Construct a FP-tree:

Page 8: Top Down FP-Growth for Association Rule Mining

8

FP-tree and header table H

b, ea, b, c, e

b, c, ea, c, d

aminsup = 2

Header Table H

item Head of node-link

a

b

c

e

TD-FP-Growth for frequent pattern miningFP-growth: bottom-up mining

b: 2

root

b: 1 c: 1

a: 3

e: 1

c: 1e: 1

c: 1

e: 1

(b: 1)(b: 1, c: 1)(a: 1, b: 1, c: 1)

e’s conditional pattern base

Mining order:e, c, b, a

Page 9: Top Down FP-Growth for Association Rule Mining

9

TD-FP-Growth for frequent pattern mining• FP-growth: bottom-up mining

(b: 1)(b: 1, c: 1)(a: 1, b: 1, c: 1)

e’s conditional pattern base

root

b: 3

c: 2e’s conditional FP-tree

item Head of node-link

b

c

drawback!• both e’s conditional pattern base and conditional FP-tree are stored in memory

• mine e’s conditional FP-tree recursively

• conditional pattern bases and FP-trees are built for all other items and their super-patterns

Page 10: Top Down FP-Growth for Association Rule Mining

10

TD-FP-Growth for frequent pattern mining• TD-FP-Growth : adopt top-down

mining strategy– motivation: avoid building extra

databases and sub-trees as FP-growth does

– method: process nodes on the upper level before those on the lower level

– result: any modification happened on the upper level nodes would not affect the lower level nodes

See example

Page 11: Top Down FP-Growth for Association Rule Mining

11

TD-FP-Growth for frequent pattern mining

b, ea, b, c, e

b, c, ea, c, d

aminsup = 2

Header Table H

CT-tree and header table H

Entry value count side-link

a

b

c

e

3

3

3

3

b: 2

root

b: 1 c: 1

a: 3

e: 1

c: 1e: 1

c: 1

e: 1

Mining order:a, b, c, e

Page 12: Top Down FP-Growth for Association Rule Mining

12

CT-tree for frequent pattern mining

b, ea, b, c, e

b, c, ea, c, d

aminsup = 2

a: 2

b: 1root

b: 2 a: 3

Entry value count side-link

a

b

c

e

3

3

3

3e: 1 c: 1 b: 1 c: 1

CT-tree and header table H

e: 1

sub-header-table H_c

Entry value count

side-link

a

b

2

2

Page 13: Top Down FP-Growth for Association Rule Mining

13

CT-tree for frequent pattern mining• Completeness

– for entry i in H, we mine all the frequent patterns that end up with item i, no more and no less

• Complete set of frequent patterns:{a }

{b }{c }, {b, c }, {a, c } {e }, {b, e }, {c, e }, {b, c, e }

Page 14: Top Down FP-Growth for Association Rule Mining

14

TD-FP-Growth for frequent pattern mining• Comparing to FP-growth, TD-FP-

Growth is:– Space saving:

• only one tree and a few header tables• no extra databases and sub-trees

– Time saving:• does not build extra databases and sub-

trees• walk up path only once to update count

information for nodes on the tree and build sub-header-tables.

Page 15: Top Down FP-Growth for Association Rule Mining

15

TD-FP-Growth for association rule mining• Assumptions:

– There is a class-attribute in the database– Items in the class-attribute called class-

items, others are non-class-items– Each transaction is associated a class-item – Only class-item appears in the right-hand

of the ruleTransaction ID

non-class-attribute

class-attribute

1 a, b… C1

2 d… C2

3 e, d, f… C3

… … …

example rule:a, b Ci

Page 16: Top Down FP-Growth for Association Rule Mining

16

TD-FP-Growth for association rule mining--multi mini support• Why?

– Use uniform minimum support, computation of count considers only number of appearance

– Uniform minimum support is unfair to items that appears less but worth more. • Eg. responder vs. non-responder

• How?– Use different support threshold for

different class

Page 17: Top Down FP-Growth for Association Rule Mining

17

TD-FP-Growth for association rule mining -- multi mini support• multiple VS. uniform

– C1 : 4, C 2 : 2– rules with relative minsup = 50%

proportional to each class -- multiplier in performance• uniform minimum support: absolute minsup

= 1; – 11 nodes tree, 23 rules

• multiple minimum supports: absolute minsup1 = 2; absolute minsup2 = 1;

– 7 nodes tree, 9 rules– more effective and space-saving– time-saving --- show in performance

c, f, C1

b, e, C2

b, e, f, C1

a, c, f, C1

c, e, C2

b, c, d, C1

Page 18: Top Down FP-Growth for Association Rule Mining

18

TD-FP-Growth for association rule mining--conf pruning• Motivation

– make use of the other constraint of association rule: confidence, to speed up mining

• Method– confidence is not anti-monotone– introduce: acting constraint of

confidence, which is anti-monotone– push it inside the mining process

Page 19: Top Down FP-Growth for Association Rule Mining

19

TD-FP-Growth for association rule mining--conf pruning

conf(A B) = count(AB) / count(A) >= minconf

count(AB) >= count(A) * minconf

count(AB) >= minsup * minconf

(anti-monotone & weaker)

--- the acting constraint of confidence for the original confidence constraint of rule A B

• support of rule is computed by: count(A) • count(AB): class-count of itemset A related to class B

Page 20: Top Down FP-Growth for Association Rule Mining

20

TD-FP-Growth for association rule mining--conf pruning

c, f, C1

b, e, C2

b, e, f, C1

a, c, f, C1

a, c, d, C2

minsup = 2minconf= 60%

root

Entry value i

count (i) count(i,C1) count(i,C2) side-link

a

b

c

ef

2

2

3

23

1

1

2

13

1

1

1

10

……

Header table H: count(i) = count(i, C1) + count(i, C2)

count(e) >= minsup; However,both count(e, C1) & count(e, C2) < minsup * minconf;

terminate mining for e!

Entry value i

count (i) count(i,Ci) count(i,C2) side-link

If no confidence pruning b 2 1 1

sub-header-table H_e

Page 21: Top Down FP-Growth for Association Rule Mining

21

Performance• Choose several data sets from UC_Irvine

Machine Learning Database Repository:

http://www.ics.uci.edu/~mlearn/MLRepository.html.

name of dataset

# of transactions

# of items in each

transactionclass distribution

# of distinct items

Dna-train 2000 6123.2%, 24.25%,

52.55%240

Connect-4 67557 439.55%, 24.62%,

65.83%126

Forest 581012 130.47%, 1.63%, 2.99%,

3.53%, 6.15%, 36.36%, 48.76%

15916

Page 22: Top Down FP-Growth for Association Rule Mining

22

Performance—frequent pattern

results on Forest

0

20

40

60

80

100

0% 2% 4% 6% 8% 10%

support threshold

CT-tree

Fp-growth

Apriori

Page 23: Top Down FP-Growth for Association Rule Mining

23

Performance — mine rules with multiple minimum supports

multiple sup on Forest

10

100

1000

10000

100000

0% 1% 2% 3% 4% 5%

multiplier (minconf =90%)

CT-multi-supApri-uni-supCT-uni-sup

relative minsup, proportional to each class

FP-growth is only for frequent

pattern mining

Page 24: Top Down FP-Growth for Association Rule Mining

24

Performance — mine rules with confidence pruning

conf-pruning on Forest

0

100

200

300

0.04% 0.05% 0.06% 0.07% 0.08% 0.09% 0.10%

support threshold(minconf = 90%)

CT-conf-prune

Apriori

CT-no-conf-prune

Page 25: Top Down FP-Growth for Association Rule Mining

25

Conclusions and future work• Conclusions of TD-FP-Growth algorithm

– more efficient in finding both frequent patterns and association rules

– more effective in mining rules by using multiple minimum supports

– Introduce a new pruning method: confidence pruning, and push it inside the mining process; thus further speed up mining

Page 26: Top Down FP-Growth for Association Rule Mining

26

Conclusions and future work• Future work

– Explore other constraint-based association rule mining method

– Mine association rules with item concept hierarchy

– Apply TD-FP-Growth to applications based on association rule mining• Clustering• Classification

Page 27: Top Down FP-Growth for Association Rule Mining

27

Reference• (1)   R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between

sets of items in large databases. Proc. 1993 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’93), pages 207-216, Washington, D.C., May 1993.

• (2)   U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

• (3)   H. Toivonen. Sampling large databases for association rules. Proc. 1996 Int. Conf. Very Large Data Bases (VLDB’96), pages 134-145, Bombay, India, September 1996.

• (4)   R. Agrawal and S. Srikant. Mining sequential patterns. Proc. 1995 Int. Conf. Data Engineering (ICDE’95), pages 3-14, Taipei, Taiwan, March 1995.

• (5)   J. Han, J. Pei and Y. Yin. Mining Frequent Patterns without Candidate Generation. Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’00), pages 1-12, Dallas, TX, May 2000.

• (6) J. Han, J. Pei, G. Dong, and K. Wang. Efficient Computation of Iceberg Cubes with Complex Measures. Proc. 2001 ACM-SIGMOD Int. Conf., Santa Barbara, CA, May 2001.

And more!