CHAPTER 4 MULTIDIMENSIONAL FREQUENT PATTERN...

53

CHAPTER 4

MULTIDIMENSIONAL FREQUENT PATTERN MINING

In this chapter an efficient algorithm for multi-dimensional frequent

pattern mining is proposed overcoming the four limitations of existing

algorithms. First, the classic frequent pattern mining algorithms (i.e. Apriori,

FPgrowth) have been focused on mining knowledge at single concept levels,

i.e., either primitive or rather high concept level. However, it is often

desirable to discover knowledge at multiple concept levels. Second, in real

life applications, multiple dimensions, such as store locations, may be

associated with transactions. Incorporating dimension information into the

mining process can produce patterns with more detailed knowledge. For

example, pattern (store location: BC, IBM Laptop, HP Epson Color Printer)

with a support threshold of 60% not only informs us of the association

between two items – IBM Laptop and HP EpsonColor Printer, but also

points out that such a combination frequently occurred in the stores located in

British Columbia. Thirdly, previously proposed algorithms for

multidimensional frequent pattern mining adopted an Apriori-like method. It

is well known that the Apriori method relies on iterative pattern generation

and multiple database scans. Hence, the efficiency of the Apriori method

might suffer in situations of generating long patterns. Recently, a novel

algorithm, FP-growth (Han et al 2000, 2004), is proposed to mine frequent

patterns. FP-growth is proved to achieve a better system performance than

traditional frequent pattern mining algorithms. Lastly, the classic frequent

pattern mining algorithms adopt a uniform support threshold. Yet in reality,

the minimum support is not uniform. Exceptional items often have either

54

much lower or much higher support than general cases. Moreover, in most

transaction databases, items appear at different abstraction levels. Thus, a

uniform threshold might lead to either generate uninteresting patterns at high

concept level or miss important patterns at primitive level. This chapter

proposes a model for mining with various support constraints and explore a

way to extend FP-growth to multidimensional frequent pattern mining.

In this chapter, FP-growth algorithm is extended to attack the

problem of multidimensional frequent pattern mining. The proposed

algorithm is guaranteed by the high scalability of FP-growth. To increase

effectiveness, the proposed algorithm pushes various support constraints into

the mining process. The proposed algorithm is more flexible at capturing

desired knowledge than existing FP-growth Algorithm.

4.1 PROBLEM FORMULATION

A pattern or an item-set, p, is one dimension Dj or one item Ak, or a

set of conjunctive items and dimensions Di^…^DjÂk^…Âl, where

Ai^…Âj τ∈ . The support of a pattern P is the number of transactions that

contain P versus the total number of transactions. Pattern P is frequent if its

support satisfies the minimum pattern generation thresholdξ .

The problem of mining multi-dimensional frequent patterns is

attacked by implementing the proposed algorithm. The proposed algorithm is

able to discover associations between items and dimensions as well as

associations among items. The proposed algorithm improves the effectiveness

of frequent pattern mining by pushing various support constraints inside the

mining process.

55

4.2 FREQUENT PATTERN MINING

The process of discovering the complete set of frequent patterns is

also called “frequent pattern mining”. Its definition is given below.

Definition

Let t = {i1,i2,…..im} be a set of items. Let D be a set of transactions,

where each transaction T is a set of items such that t →T. Patterns are

essentially a set of items and are also referred to as itemsets, the two terms –

“itemsets” and “patterns” alternatively. An itemset that contains k items is a k

-itemset. The occurrence of an itemset is the number of transactions that

contain the itemset. This is also known as frequency or support count of the

itemset. The task of frequent pattern mining is to generate all patterns (or

itemsets) whose occurrences (or support) are greater than or equal to the user-

specified minimum support. Researchers have been seeking for efficient

solutions to the problem of frequent pattern mining since 1993.

4.3 MULTI-DIMENSIONAL FREQUENT PATTERN MINING

Real transaction databases usually contain both item information

and dimension information. Moreover, taxonomies about items likely exist.

This chapter explores the problem of multi-dimensional frequent pattern

mining and an example of multidimensional frequent pattern mining as shown

in Table 4.1.

56

Table 4.1 An AllElectronics Database Illustration

Store

Location Trans- ID List of Item Ids

BC 001 (TV,Color TV,Sony Color TV);

(Computer,Laptop,IBM Laptop);

(Printer,Color Printer,HP Epson Color Printer)

ON 001 (Printer, Color Printer, HP Epson Color Printer)

BC 002 (TV,Color TV,Sony Color TV);

(Computer, Laptop, IBM Laptop)

ON 002 (Computer,Laptop,IBM Laptop)

BC 003 (TV, Color TV, Sony Color TV);

(Computer, Laptop,IBM Laptop)

4.4 STUDY AND IMPLEMENTATION OF FP-GROWTH

ALGORITHM

It is noticed that the bottleneck of the Apriori method rests on the

candidate set generation and test. An algorithm called FP-growth given by

Han et al (2004) is reported to be faster than the Apriori algorithm. The high

efficiency of FP-growth is achieved in the following three aspects. They form

the distinct features of FP-growth.

First, an extended prefix tree structure, called frequent pattern tree

or FP-tree in short, is used to compress the relevant database information.

Only frequent length-1 items will have nodes in the tree, and the tree nodes

are arranged in such a way that more frequently occurring nodes will have

better chances of sharing than less frequently occurring nodes.

57

Secondly, an FP-tree-based pattern fragmentation growth mining

method – FP-growth, is developed. Starting from a frequent length-1 pattern

(as an initial suffix pattern), FPgrowth examines only its conditional pattern

base (a “ sub-database” which consists of the set of frequent items co-

occurring with the suffix pattern), constructs its conditional FP-tree and

performs mining recursively on such a tree. The pattern growth is achieved

via concatenation of the suffix pattern with the new ones generated from a

conditional FP-tree. Since the frequent pattern in any transaction is always

encoded in the corresponding path of the frequent pattern trees, pattern

growth ensures the completeness of the result. In this context, FP-growth is

not Apriori-like restricted generation-and-test but restricted test only. The

major operations of mining are count accumulation and prefix path count

adjustment, which are usually much less costly than candidate generation and

pattern matching operations performed in most Apriori-like algorithms.

Thirdly, the search technique employed in mining is a partition-

based, divide-and conquer method rather than Apriori-like bottom-up

generation of frequent pattern combinations. This dramatically reduces the

size of conditional pattern base generated at the subsequent level of search as

well as the size of its corresponding conditional FP-tree. Inherently, it

transforms the problem of finding long frequent patterns to looking for shorter

ones and concatenating with the suffix.

The function of FP-growth is to generate all frequent patterns in

which database scans are needed only twice. One is used to find out frequent

1-itemsets and the other is used to construct a FP-tree. The remaining

operation is recursively mine the FP-tree using FP-growth. Here, FP-tree

resides in main-memory and therefore FP-growth avoids the costly DB scans.

To illustrate the FP-tree data structure and FP-growth algorithm,

examine the following example. The FP-growth algorithm is performed on

58

the transaction database TDB (table 4.2) and set the absolute minimum

support (min_sup) to 2.

A transaction database TDB is given in Table 4.2 Here <40,{a, c, d,

f }> is a transaction, in which 40 is the transaction identifier, and {a, c, d, f }

is a set of items. {a, c, d, f } can also be denoted as acdf .

Table 4.2 A Transaction Database TDB

Transaction ID Items in

Transaction

10 a, c, d, e, f

20 a, b, e

30 c, e, f

40 a, c, d, f

50 c, e, f

Step 1:

Scan the transaction database TDB once, collect the count for each

item, and eliminate those items whose support do not pass the specified

support threshold.

After step 1, get {a} : 3 ; {b} :1; {c} : 4; {d} : 2 ; {e}: 4 ;

{ f } : 4. Since the min_sup is 2, the list of frequent 1-itemsets is {a} : 3 ; {c} : 4;

{d} : 2 ; {e} : 4 ; { f } : 4.

Step 2:

Scan the transaction database TDB the second time. For each

transaction, filter out the infrequent items and sort the remaining ones in

59

descending order of frequency as shown in Table 4.3. Insert the pattern into

the FP-tree as a branch. (As shown in Figure 4.1.)

Table 4.3 Transaction Database in Ordered Frequent Items

Transaction ID Items in transaction (ordered) Frequent

Items

10 a, c, d, e, f f, e, c, a, d

20 a, b, e e, a

30 c, e, f f, e, c

40 a, c, d, f f, c, a, d

50 c, e, f f, e, c

Figure 4.1 FP-Tree

60

Step 3:

Use FP-growth to recursively mine FP-tree. Describe the FP-

growth algorithm as below. mine FP-tree from bottom to top. Starting from d,

for each frequent 1-item, and construct its conditional pattern base. A

conditional pattern base for an item/itemset contains the transactions that end

with that item/itemset, and treat the conditional pattern base the same as a

transaction database and build the conditional FP-tree(Fig 4.2). The FPgrowth

algorithm is recursively performed on such conditional FP-trees. Item d ’s

conditional pattern base is: {( f :1, e :1, c :1, a :1), ( f :1, c :1, a :1)} . In this

conditional pattern base, e occurs only once and is eliminated. The

conditional FPtree is constructed as below.

Figure 4.2 Conditional FP-Tree

There is only one branch in the d ’s conditional FP-tree. The

possible combinations are: ( f ,c,a,d : 2),(c,a,d : 2),( f ,a,d : 2),(a,d : 2),( f ,c,d :

2),(c,d : 2),( f ,d : 2) .

Item a ’s conditional pattern base is: {( f :1, e :1, c :1), (e :1), ( f :1,

c :1)} . Likewise, construct a ’s conditional FP-tree and generate the frequent

patterns as: ( f ,c,a : 2),(c,a : 2),(e,a : 2),( f ,a : 2) .

61

Item c ’s conditional pattern base is: {( f : 3, e : 3), ( f :1)}. Thus in

c ’s conditional FPtree,

And then generate frequent patterns as: ( f ,e,c : 4),(e,c : 3),( f ,c : 4)

Item e ’s conditional pattern base is: {( f : 3)} and the frequent

patterns in e ’s conditional FP-tree is: ( f ,e : 3) .

Combined with the frequent 1-items generated during the first

database scan, and get the same set of frequent patterns – ( f ,c,a,d : 2),

(c,a,d : 2),( f ,a,d : 2),(a,d : 2),( f ,c,d : 2),(c,d : 2),( f ,d : 2),(d : 2), ( f ,c,a : 2),

(c,a : 2),(e,a : 2),( f ,a : 2),(a : 3), ( f ,e,c : 4),(e,c : 3),( f ,c : 4),(c : 4) ,

( f ,e : 3),(e : 4) , ( f : 4) .

By using FP-growth, the database scans are needed twice. When

mining frequent patterns on large databases or when mining longer frequent

patterns, FP-growth outperforms Apriori significantly. When only a small

portion of the candidate sets will survive to become frequent patterns, Apriori

loses drastically due to the costly candidate generation. This chapter discusses

how to generate the complete set of frequent patterns from large databases

more efficiently. In the next section, FP growth algorithm is extended to

attack the problem of multidimensional frequent pattern mining and a new

algorithm is proposed.

4.5 PROPOSED ALGORITHM

The proposed algorithm is a three-step process. In the first step,

databases are scanned once to get the count of every single item and every

single dimension. The frequent 1-items or frequent 1-dimensions are those

whose counts pass their corresponding support threshold. In the second step,

databases are scanned again to construct an FP-tree. Here items or dimensions

62

can appear in the FP-tree as long as their counts pass their corresponding

confidence threshold. Thus, a frequent pattern which includes the whole

taxonomy information about an item is also interesting to the user. Finally, all

frequent patterns are generated by using FP-tree recursively mines.

4.5.1 To Find Frequent 1-Items and Frequent 1-Dimensions

In this step, scan the transaction database D once. During this

database scan, collect the count for each dimension and item. In the

meanwhile, compare their counts with the corresponding two types of

thresholds – passage threshold and printing thresholds. For each individual

dimension, compare its count with the dimension passage threshold and the

dimension printing threshold. Eliminate dimensions whose support do not

even pass the corresponding support threshold. For each individual item, first

detect the abstraction level the item resides in and check whether it is a

normal item or an exceptional item. If it is a normal item, compare the item

support with the corresponding passage threshold and printing threshold:

Otherwise, it narrow the possibility further and label this item as either a very

common one or a very rare one and compare the item support with the

corresponding passage support and printing support. Under all circumstances,

the item will not be printed as a frequent 1- item unless its support passes the

corresponding item printing threshold. All items whose support passes the

corresponding item passage threshold are possible to appear in the frequent 2

or even longer patterns. The items whose supports do not pass the

corresponding passage threshold die in the comparison.

63

4.5.2 To Construct an FP-tree for the given transaction database

The proposed algorithm adopts the same prefix-tree structure as the

one taken by FP-growth. The structure of FP-tree is defined below. It consists

of one root labeled as “ null” , a set of item prefix subtrees as the children of

the root, and a frequent item header table.

• Each node in the item prefix subtree consists of three fields:

name, count and node-link. The name registers the item (or

dimension) represented by the node; Count registers the

number of transactions represented by the portion of the path

reaching this node; Node-link links to the next node in the FP-

tree carrying the same item-name (or dimension-name). Node-

link is null if there is none.

• Each entry in the frequent item header table consists of two

fields: (1) name (2) head of node link. Name represents the

item name or the dimension name. Head of node link points to

the first node in the FP-tree carrying the item-name (or

dimension name). The procedure of constructing an FP-tree is

described as shown in Figure 4.3. It is a two-step process.

• Scan the transaction database D once. Collect the set of

frequent items (or dimensions) F and their supports. Create the

root of a FP-tree, T, and label it as “null”.

64

Figure 4.3 FP-tree for transaction database of table 4.1

4.5.3 Recursively Mine FP-tree to Generate Multi-dimensional

Frequent Patterns

In 4.5.1 and 4.5.2, gather the counts for each individual item and

dimension. Here, compress the complete information about the transaction

database in the FP-tree. They are realized at the cost of two database scans.

From now on, start the pattern generation process by recursively visiting the

FP-tree. No more costly database operations will be involved.

Starting from the least frequent item/dimension, generate

conditional pattern base and accordingly construct conditional FP-tree for

each member of the header table. Conditional pattern base for an item i (or a

dimension d ) consists of the items/dimensions that co-occur with the item I

(or a dimension d ). In the same manner, explore the conditional FP-tree using

the proposed algorithm.

The essential of the proposed algorithm is the FP-growth operation.

The critical parts are the pushing of various support constraints. The proposed

algorithm for frequent pattern generation is given below.

65

Algorithm for Frequent pattern Generation

The input are: (1) the FP-tree that construct in step 2, (2) length k-level passage

thresholds (k ≥2), (3) length k-level printing thresholds (k ≥2), (4) item

passage threshold for special items, (5) item printing threshold for special items.

Call extended-FP-growth(FP-tree,null)

Procedure extended-FP-growth(Tree,Y ){

if(Tree contains a single path P)then{

for(each combination-denoted as X of the nodes in the path P) do{

generate pattern X∪ Y with support-minimum support of nodes in X :

if (pattern X∪Y contains special items) then{

if (pattern X∪Y‘s support is larger than or equal to special item’s passage

threshold) then {

if(pattern X∪Y‘s support is larger than or equal to special item’s printing

threshold ) then{

add X∪Y to L:

add X∪Y to C:

} }

else{

len=the length of pattern X∪Y:

if(pattern X∪Y‘s support is larger than or equal to length-len passage threshold)

then {

if(pattern X∪Y‘s support is larger than or equal to length-len printing

threshold)then {

add X∪Y to L;

add X∪Y to C; } } } }

else for (each ai in the header of Tree)do{

generate pattern X =ai ∪Y with support = ai support;

construct X‘s conditional pattern base and then X‘s conditional FP-tree Treep:

}

if(Treep≠φ )then

call extended-FP-growth(Treep.X)

}

66

4.6 DATASET DESCRIPTIONS

Here, IBM synthetic data generator is integrated into this work

(http://www.almaden.ibm.com/software/quest/ Resources/index.shtml). This

generator can generate datasets in various kinds of data distributions. The

dataset generated will be named in the way such as “T10I4D100K” . Hereby,

“T” means average number of items per transaction; “I” means average length

of maximal pattern; and “D” means total number of transactions in the

dataset. By default, the number of items in the dataset is one tenth of the

number of transactions. Therefore, “T10I4D100K” refers to a dataset with the

following characteristics: average number of items per transaction is 10;

average length of maximal pattern is 4; total number of transactions is

100,000; and total number of items is 10,000. This dataset can be

downloaded from the UC-Irvine Machine Learning Database Repository. The

total number of transactions is 67,557, while each transaction is with 43

items. It is a dense dataset with a lot of long frequent itemsets.

4.7 EXPERIMENTAL EVALUATION AND RESULTS

Here, the proposed algorithm and FP-growth algorithm on two

(T10I4D100K, T25I20D100K) datasets are tested under different support

threshold.

Table 4.4 Mining Complete Set of Frequent Patterns on T10I4D100K

Algorithms Runtime (in secs) at Different Supports(%)

0.05 0.1 0.5 1 5 10 15

FP-Growth 463 446 154 36 57 8 1

Proposed Algorithm 149 66 20 8 3 1 1

67

0

50

100

150

200

250

300

350

400

450

500

0.05 0.1 0.5 1 5 10 15

Support Count(in %)

Ru

nti

me (

in s

ecs)

FP-Growth

Proposed Algorithm

Figure 4.4 Mining Complete Set of Frequent Patterns on T10I4D100K

Table 4.5 Mining Complete Set of Frequent Patterns on T25I20D100K

Algorithms Runtime (in secs) at Different Supports(%)

20 25 30 35 40 45 50

FP-Growth 6430 3450 1455 343 257 98 10

Proposed Algorithm 1940 756 459 184 38 21 5

0

1000

2000

3000

4000

5000

6000

7000

20 25 30 35 40 45 50

Support Count(in %)

Ru

nti

me (

in S

ecs)

FP-Growth

Proposed Algorithm

Figure 4.5 Mining Complete Set of Frequent Patterns on T25I20D100K

68

Figures 4.4 and 4.5 in the respective Tables 4.4 and 4.5 show the

scalability of support threshold of the proposed algorithm and FP-growth

algorithms on two different datasets. From the graphs, one can conclude that

among these algorithms, proposed method takes the least time to generate the

complete set of frequent patterns. The high efficiency of proposed algorithm

is more apparent especially in dense datasets. To conclude, proposed algorithm is

the most efficient algorithm for mining complete set of frequent patterns.

A uniform support constraint is suitable for the task. The

algorithm steps in with the proposal of flexible support constraints. Different

levels should have different values of support threshold. Exceptional items

(either common such as milk or rare such as diamond) should be picked out

and appraised against a different thresholds and patterns of different length

should be associated with different thresholds as well. Figure 4.6 shows how

flexible support constraints contribute to increasing the effectiveness of

multi-dimensional frequent pattern mining. It also reports the difference

between the number of multidimensional frequent patterns generated in

situations of applying uniform support constraints and pushing various

support constraints. Experiments are performed on T25I10M3D10K dataset.

0

5

10

15

20

25

30

35

0.6 0.8 1 1.2 1.5

Support Threshold(%)

Nu

mb

er

of

patt

ern

s(K

) Number of patterns using the

uniform low level support

Threshold

Number of patterns using

various support Threshold

Number of patterns using

uniform high level support

Threshold

Figure 4.6 Effectiveness of Flexible Support Constraints

69

All the experiments are performed on a 2.20GHz Intel® core 2 duo

Laptop with 3.00 GB RAM, running on Microsoft Windows. All programs

are written in Microsoft Visual C++ 6.0. The proposed method is used to

facilitate decision making and boost business sales. General frequent pattern

mining algorithms focus on mining at single level. Besides, only strong

associations between items have been discovered.

CHAPTER 4 MULTIDIMENSIONAL FREQUENT PATTERN...

Documents

Transcript of CHAPTER 4 MULTIDIMENSIONAL FREQUENT PATTERN...