CHAPTER 4 MULTIDIMENSIONAL FREQUENT PATTERN...
Transcript of CHAPTER 4 MULTIDIMENSIONAL FREQUENT PATTERN...
53
CHAPTER 4
MULTIDIMENSIONAL FREQUENT PATTERN MINING
In this chapter an efficient algorithm for multi-dimensional frequent
pattern mining is proposed overcoming the four limitations of existing
algorithms. First, the classic frequent pattern mining algorithms (i.e. Apriori,
FPgrowth) have been focused on mining knowledge at single concept levels,
i.e., either primitive or rather high concept level. However, it is often
desirable to discover knowledge at multiple concept levels. Second, in real
life applications, multiple dimensions, such as store locations, may be
associated with transactions. Incorporating dimension information into the
mining process can produce patterns with more detailed knowledge. For
example, pattern (store location: BC, IBM Laptop, HP Epson Color Printer)
with a support threshold of 60% not only informs us of the association
between two items – IBM Laptop and HP EpsonColor Printer, but also
points out that such a combination frequently occurred in the stores located in
British Columbia. Thirdly, previously proposed algorithms for
multidimensional frequent pattern mining adopted an Apriori-like method. It
is well known that the Apriori method relies on iterative pattern generation
and multiple database scans. Hence, the efficiency of the Apriori method
might suffer in situations of generating long patterns. Recently, a novel
algorithm, FP-growth (Han et al 2000, 2004), is proposed to mine frequent
patterns. FP-growth is proved to achieve a better system performance than
traditional frequent pattern mining algorithms. Lastly, the classic frequent
pattern mining algorithms adopt a uniform support threshold. Yet in reality,
the minimum support is not uniform. Exceptional items often have either
54
much lower or much higher support than general cases. Moreover, in most
transaction databases, items appear at different abstraction levels. Thus, a
uniform threshold might lead to either generate uninteresting patterns at high
concept level or miss important patterns at primitive level. This chapter
proposes a model for mining with various support constraints and explore a
way to extend FP-growth to multidimensional frequent pattern mining.
In this chapter, FP-growth algorithm is extended to attack the
problem of multidimensional frequent pattern mining. The proposed
algorithm is guaranteed by the high scalability of FP-growth. To increase
effectiveness, the proposed algorithm pushes various support constraints into
the mining process. The proposed algorithm is more flexible at capturing
desired knowledge than existing FP-growth Algorithm.
4.1 PROBLEM FORMULATION
A pattern or an item-set, p, is one dimension Dj or one item Ak, or a
set of conjunctive items and dimensions Di^…^Dj^Ak^…^Al, where
Ai^…^Aj τ∈ . The support of a pattern P is the number of transactions that
contain P versus the total number of transactions. Pattern P is frequent if its
support satisfies the minimum pattern generation thresholdξ .
The problem of mining multi-dimensional frequent patterns is
attacked by implementing the proposed algorithm. The proposed algorithm is
able to discover associations between items and dimensions as well as
associations among items. The proposed algorithm improves the effectiveness
of frequent pattern mining by pushing various support constraints inside the
mining process.
55
4.2 FREQUENT PATTERN MINING
The process of discovering the complete set of frequent patterns is
also called “frequent pattern mining”. Its definition is given below.
Definition
Let t = {i1,i2,…..im} be a set of items. Let D be a set of transactions,
where each transaction T is a set of items such that t →T. Patterns are
essentially a set of items and are also referred to as itemsets, the two terms –
“itemsets” and “patterns” alternatively. An itemset that contains k items is a k
-itemset. The occurrence of an itemset is the number of transactions that
contain the itemset. This is also known as frequency or support count of the
itemset. The task of frequent pattern mining is to generate all patterns (or
itemsets) whose occurrences (or support) are greater than or equal to the user-
specified minimum support. Researchers have been seeking for efficient
solutions to the problem of frequent pattern mining since 1993.
4.3 MULTI-DIMENSIONAL FREQUENT PATTERN MINING
Real transaction databases usually contain both item information
and dimension information. Moreover, taxonomies about items likely exist.
This chapter explores the problem of multi-dimensional frequent pattern
mining and an example of multidimensional frequent pattern mining as shown
in Table 4.1.
56
Table 4.1 An AllElectronics Database Illustration
Store
Location Trans- ID List of Item Ids
BC 001 (TV,Color TV,Sony Color TV);
(Computer,Laptop,IBM Laptop);
(Printer,Color Printer,HP Epson Color Printer)
ON 001 (Printer, Color Printer, HP Epson Color Printer)
BC 002 (TV,Color TV,Sony Color TV);
(Computer, Laptop, IBM Laptop)
ON 002 (Computer,Laptop,IBM Laptop)
BC 003 (TV, Color TV, Sony Color TV);
(Computer, Laptop,IBM Laptop)
4.4 STUDY AND IMPLEMENTATION OF FP-GROWTH
ALGORITHM
It is noticed that the bottleneck of the Apriori method rests on the
candidate set generation and test. An algorithm called FP-growth given by
Han et al (2004) is reported to be faster than the Apriori algorithm. The high
efficiency of FP-growth is achieved in the following three aspects. They form
the distinct features of FP-growth.
First, an extended prefix tree structure, called frequent pattern tree
or FP-tree in short, is used to compress the relevant database information.
Only frequent length-1 items will have nodes in the tree, and the tree nodes
are arranged in such a way that more frequently occurring nodes will have
better chances of sharing than less frequently occurring nodes.
57
Secondly, an FP-tree-based pattern fragmentation growth mining
method – FP-growth, is developed. Starting from a frequent length-1 pattern
(as an initial suffix pattern), FPgrowth examines only its conditional pattern
base (a “ sub-database” which consists of the set of frequent items co-
occurring with the suffix pattern), constructs its conditional FP-tree and
performs mining recursively on such a tree. The pattern growth is achieved
via concatenation of the suffix pattern with the new ones generated from a
conditional FP-tree. Since the frequent pattern in any transaction is always
encoded in the corresponding path of the frequent pattern trees, pattern
growth ensures the completeness of the result. In this context, FP-growth is
not Apriori-like restricted generation-and-test but restricted test only. The
major operations of mining are count accumulation and prefix path count
adjustment, which are usually much less costly than candidate generation and
pattern matching operations performed in most Apriori-like algorithms.
Thirdly, the search technique employed in mining is a partition-
based, divide-and conquer method rather than Apriori-like bottom-up
generation of frequent pattern combinations. This dramatically reduces the
size of conditional pattern base generated at the subsequent level of search as
well as the size of its corresponding conditional FP-tree. Inherently, it
transforms the problem of finding long frequent patterns to looking for shorter
ones and concatenating with the suffix.
The function of FP-growth is to generate all frequent patterns in
which database scans are needed only twice. One is used to find out frequent
1-itemsets and the other is used to construct a FP-tree. The remaining
operation is recursively mine the FP-tree using FP-growth. Here, FP-tree
resides in main-memory and therefore FP-growth avoids the costly DB scans.
To illustrate the FP-tree data structure and FP-growth algorithm,
examine the following example. The FP-growth algorithm is performed on
58
the transaction database TDB (table 4.2) and set the absolute minimum
support (min_sup) to 2.
A transaction database TDB is given in Table 4.2 Here <40,{a, c, d,
f }> is a transaction, in which 40 is the transaction identifier, and {a, c, d, f }
is a set of items. {a, c, d, f } can also be denoted as acdf .
Table 4.2 A Transaction Database TDB
Transaction ID Items in
Transaction
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
Step 1:
Scan the transaction database TDB once, collect the count for each
item, and eliminate those items whose support do not pass the specified
support threshold.
After step 1, get {a} : 3 ; {b} :1; {c} : 4; {d} : 2 ; {e}: 4 ;
{ f } : 4. Since the min_sup is 2, the list of frequent 1-itemsets is {a} : 3 ; {c} : 4;
{d} : 2 ; {e} : 4 ; { f } : 4.
Step 2:
Scan the transaction database TDB the second time. For each
transaction, filter out the infrequent items and sort the remaining ones in
59
descending order of frequency as shown in Table 4.3. Insert the pattern into
the FP-tree as a branch. (As shown in Figure 4.1.)
Table 4.3 Transaction Database in Ordered Frequent Items
Transaction ID Items in transaction (ordered) Frequent
Items
10 a, c, d, e, f f, e, c, a, d
20 a, b, e e, a
30 c, e, f f, e, c
40 a, c, d, f f, c, a, d
50 c, e, f f, e, c
Figure 4.1 FP-Tree
60
Step 3:
Use FP-growth to recursively mine FP-tree. Describe the FP-
growth algorithm as below. mine FP-tree from bottom to top. Starting from d,
for each frequent 1-item, and construct its conditional pattern base. A
conditional pattern base for an item/itemset contains the transactions that end
with that item/itemset, and treat the conditional pattern base the same as a
transaction database and build the conditional FP-tree(Fig 4.2). The FPgrowth
algorithm is recursively performed on such conditional FP-trees. Item d ’s
conditional pattern base is: {( f :1, e :1, c :1, a :1), ( f :1, c :1, a :1)} . In this
conditional pattern base, e occurs only once and is eliminated. The
conditional FPtree is constructed as below.
Figure 4.2 Conditional FP-Tree
There is only one branch in the d ’s conditional FP-tree. The
possible combinations are: ( f ,c,a,d : 2),(c,a,d : 2),( f ,a,d : 2),(a,d : 2),( f ,c,d :
2),(c,d : 2),( f ,d : 2) .
Item a ’s conditional pattern base is: {( f :1, e :1, c :1), (e :1), ( f :1,
c :1)} . Likewise, construct a ’s conditional FP-tree and generate the frequent
patterns as: ( f ,c,a : 2),(c,a : 2),(e,a : 2),( f ,a : 2) .
61
Item c ’s conditional pattern base is: {( f : 3, e : 3), ( f :1)}. Thus in
c ’s conditional FPtree,
And then generate frequent patterns as: ( f ,e,c : 4),(e,c : 3),( f ,c : 4)
Item e ’s conditional pattern base is: {( f : 3)} and the frequent
patterns in e ’s conditional FP-tree is: ( f ,e : 3) .
Combined with the frequent 1-items generated during the first
database scan, and get the same set of frequent patterns – ( f ,c,a,d : 2),
(c,a,d : 2),( f ,a,d : 2),(a,d : 2),( f ,c,d : 2),(c,d : 2),( f ,d : 2),(d : 2), ( f ,c,a : 2),
(c,a : 2),(e,a : 2),( f ,a : 2),(a : 3), ( f ,e,c : 4),(e,c : 3),( f ,c : 4),(c : 4) ,
( f ,e : 3),(e : 4) , ( f : 4) .
By using FP-growth, the database scans are needed twice. When
mining frequent patterns on large databases or when mining longer frequent
patterns, FP-growth outperforms Apriori significantly. When only a small
portion of the candidate sets will survive to become frequent patterns, Apriori
loses drastically due to the costly candidate generation. This chapter discusses
how to generate the complete set of frequent patterns from large databases
more efficiently. In the next section, FP growth algorithm is extended to
attack the problem of multidimensional frequent pattern mining and a new
algorithm is proposed.
4.5 PROPOSED ALGORITHM
The proposed algorithm is a three-step process. In the first step,
databases are scanned once to get the count of every single item and every
single dimension. The frequent 1-items or frequent 1-dimensions are those
whose counts pass their corresponding support threshold. In the second step,
databases are scanned again to construct an FP-tree. Here items or dimensions
62
can appear in the FP-tree as long as their counts pass their corresponding
confidence threshold. Thus, a frequent pattern which includes the whole
taxonomy information about an item is also interesting to the user. Finally, all
frequent patterns are generated by using FP-tree recursively mines.
4.5.1 To Find Frequent 1-Items and Frequent 1-Dimensions
In this step, scan the transaction database D once. During this
database scan, collect the count for each dimension and item. In the
meanwhile, compare their counts with the corresponding two types of
thresholds – passage threshold and printing thresholds. For each individual
dimension, compare its count with the dimension passage threshold and the
dimension printing threshold. Eliminate dimensions whose support do not
even pass the corresponding support threshold. For each individual item, first
detect the abstraction level the item resides in and check whether it is a
normal item or an exceptional item. If it is a normal item, compare the item
support with the corresponding passage threshold and printing threshold:
Otherwise, it narrow the possibility further and label this item as either a very
common one or a very rare one and compare the item support with the
corresponding passage support and printing support. Under all circumstances,
the item will not be printed as a frequent 1- item unless its support passes the
corresponding item printing threshold. All items whose support passes the
corresponding item passage threshold are possible to appear in the frequent 2
or even longer patterns. The items whose supports do not pass the
corresponding passage threshold die in the comparison.
63
4.5.2 To Construct an FP-tree for the given transaction database
The proposed algorithm adopts the same prefix-tree structure as the
one taken by FP-growth. The structure of FP-tree is defined below. It consists
of one root labeled as “ null” , a set of item prefix subtrees as the children of
the root, and a frequent item header table.
• Each node in the item prefix subtree consists of three fields:
name, count and node-link. The name registers the item (or
dimension) represented by the node; Count registers the
number of transactions represented by the portion of the path
reaching this node; Node-link links to the next node in the FP-
tree carrying the same item-name (or dimension-name). Node-
link is null if there is none.
• Each entry in the frequent item header table consists of two
fields: (1) name (2) head of node link. Name represents the
item name or the dimension name. Head of node link points to
the first node in the FP-tree carrying the item-name (or
dimension name). The procedure of constructing an FP-tree is
described as shown in Figure 4.3. It is a two-step process.
• Scan the transaction database D once. Collect the set of
frequent items (or dimensions) F and their supports. Create the
root of a FP-tree, T, and label it as “null”.
64
Figure 4.3 FP-tree for transaction database of table 4.1
4.5.3 Recursively Mine FP-tree to Generate Multi-dimensional
Frequent Patterns
In 4.5.1 and 4.5.2, gather the counts for each individual item and
dimension. Here, compress the complete information about the transaction
database in the FP-tree. They are realized at the cost of two database scans.
From now on, start the pattern generation process by recursively visiting the
FP-tree. No more costly database operations will be involved.
Starting from the least frequent item/dimension, generate
conditional pattern base and accordingly construct conditional FP-tree for
each member of the header table. Conditional pattern base for an item i (or a
dimension d ) consists of the items/dimensions that co-occur with the item I
(or a dimension d ). In the same manner, explore the conditional FP-tree using
the proposed algorithm.
The essential of the proposed algorithm is the FP-growth operation.
The critical parts are the pushing of various support constraints. The proposed
algorithm for frequent pattern generation is given below.
65
Algorithm for Frequent pattern Generation
The input are: (1) the FP-tree that construct in step 2, (2) length k-level passage
thresholds (k ≥2), (3) length k-level printing thresholds (k ≥2), (4) item
passage threshold for special items, (5) item printing threshold for special items.
Call extended-FP-growth(FP-tree,null)
Procedure extended-FP-growth(Tree,Y ){
if(Tree contains a single path P)then{
for(each combination-denoted as X of the nodes in the path P) do{
generate pattern X∪ Y with support-minimum support of nodes in X :
if (pattern X∪Y contains special items) then{
if (pattern X∪Y‘s support is larger than or equal to special item’s passage
threshold) then {
if(pattern X∪Y‘s support is larger than or equal to special item’s printing
threshold ) then{
add X∪Y to L:
add X∪Y to C:
} }
else{
len=the length of pattern X∪Y:
if(pattern X∪Y‘s support is larger than or equal to length-len passage threshold)
then {
if(pattern X∪Y‘s support is larger than or equal to length-len printing
threshold)then {
add X∪Y to L;
add X∪Y to C; } } } }
else for (each ai in the header of Tree)do{
generate pattern X =ai ∪Y with support = ai support;
construct X‘s conditional pattern base and then X‘s conditional FP-tree Treep:
}
if(Treep≠φ )then
call extended-FP-growth(Treep.X)
}
66
4.6 DATASET DESCRIPTIONS
Here, IBM synthetic data generator is integrated into this work
(http://www.almaden.ibm.com/software/quest/ Resources/index.shtml). This
generator can generate datasets in various kinds of data distributions. The
dataset generated will be named in the way such as “T10I4D100K” . Hereby,
“T” means average number of items per transaction; “I” means average length
of maximal pattern; and “D” means total number of transactions in the
dataset. By default, the number of items in the dataset is one tenth of the
number of transactions. Therefore, “T10I4D100K” refers to a dataset with the
following characteristics: average number of items per transaction is 10;
average length of maximal pattern is 4; total number of transactions is
100,000; and total number of items is 10,000. This dataset can be
downloaded from the UC-Irvine Machine Learning Database Repository. The
total number of transactions is 67,557, while each transaction is with 43
items. It is a dense dataset with a lot of long frequent itemsets.
4.7 EXPERIMENTAL EVALUATION AND RESULTS
Here, the proposed algorithm and FP-growth algorithm on two
(T10I4D100K, T25I20D100K) datasets are tested under different support
threshold.
Table 4.4 Mining Complete Set of Frequent Patterns on T10I4D100K
Algorithms Runtime (in secs) at Different Supports(%)
0.05 0.1 0.5 1 5 10 15
FP-Growth 463 446 154 36 57 8 1
Proposed Algorithm 149 66 20 8 3 1 1
67
0
50
100
150
200
250
300
350
400
450
500
0.05 0.1 0.5 1 5 10 15
Support Count(in %)
Ru
nti
me (
in s
ecs)
FP-Growth
Proposed Algorithm
Figure 4.4 Mining Complete Set of Frequent Patterns on T10I4D100K
Table 4.5 Mining Complete Set of Frequent Patterns on T25I20D100K
Algorithms Runtime (in secs) at Different Supports(%)
20 25 30 35 40 45 50
FP-Growth 6430 3450 1455 343 257 98 10
Proposed Algorithm 1940 756 459 184 38 21 5
0
1000
2000
3000
4000
5000
6000
7000
20 25 30 35 40 45 50
Support Count(in %)
Ru
nti
me (
in S
ecs)
FP-Growth
Proposed Algorithm
Figure 4.5 Mining Complete Set of Frequent Patterns on T25I20D100K
68
Figures 4.4 and 4.5 in the respective Tables 4.4 and 4.5 show the
scalability of support threshold of the proposed algorithm and FP-growth
algorithms on two different datasets. From the graphs, one can conclude that
among these algorithms, proposed method takes the least time to generate the
complete set of frequent patterns. The high efficiency of proposed algorithm
is more apparent especially in dense datasets. To conclude, proposed algorithm is
the most efficient algorithm for mining complete set of frequent patterns.
A uniform support constraint is suitable for the task. The
algorithm steps in with the proposal of flexible support constraints. Different
levels should have different values of support threshold. Exceptional items
(either common such as milk or rare such as diamond) should be picked out
and appraised against a different thresholds and patterns of different length
should be associated with different thresholds as well. Figure 4.6 shows how
flexible support constraints contribute to increasing the effectiveness of
multi-dimensional frequent pattern mining. It also reports the difference
between the number of multidimensional frequent patterns generated in
situations of applying uniform support constraints and pushing various
support constraints. Experiments are performed on T25I10M3D10K dataset.
0
5
10
15
20
25
30
35
0.6 0.8 1 1.2 1.5
Support Threshold(%)
Nu
mb
er
of
patt
ern
s(K
) Number of patterns using the
uniform low level support
Threshold
Number of patterns using
various support Threshold
Number of patterns using
uniform high level support
Threshold
Figure 4.6 Effectiveness of Flexible Support Constraints
69
All the experiments are performed on a 2.20GHz Intel® core 2 duo
Laptop with 3.00 GB RAM, running on Microsoft Windows. All programs
are written in Microsoft Visual C++ 6.0. The proposed method is used to
facilitate decision making and boost business sales. General frequent pattern
mining algorithms focus on mining at single level. Besides, only strong
associations between items have been discovered.