[IEEE 2012 3rd International Conference on Innovations in Bio-Inspired Computing and Applications...

An enhanced FUFP-tree maintenance approach for transaction deletion

Chanh-Truc Tran1, Bay Vo2, Tzung-Pei Hong3, Chun-Wei Lin3, Bac Le1 1University of Science, Ho Chi Minh City, Vietnam

2Information Technology College, Ho Chi Minh City, Vietnam 3Department CSIE, National University of Kaohsiung, Taiwan, R.O.C

[email protected], [email protected], [email protected], [email protected], [email protected]

Abstract – The fast updated frequent pattern

tree (FUFP-tree) is an efficient data structure for association-rule mining. Hong et al. (2009) proposed an approach for the maintenance of the FUFP-tree structure after the deletion of transactions. However, all transactions in the original database might need to be rescanned to determine the occurrence of infrequent items, which were not stored during the mining and maintenance process. The rescanning process steps can be time-consuming depending on the original database size and number of rescanned items. The study in this paper enhances Hong et. al.’s approach. In the proposed algorithm, the infrequent 1-itemsets are stored during the maintenance process and the rescanned items are pruned out step by step to reduce execution time. Experimental results verify the performance of the proposed algorithm.

Keyworks – frequent itemset, infrequent itemset,

Header-Table, Flist, IFlist, FList_Del, IFList_Del, corresponding branch.

I. Introduction

Many efficient techniques and algorithms for data

mining have been developed [1]. One of the most common issues in data mining is finding association rules in transaction databases [1, 2, 3, 4, 5, 10, 11]. In most association-rule mining algorithms, the rules are mined as fast as possible.

Many algorithms for mining association rules from transaction databases are based on the Apiori algorithm (Agrawal et al., 1993a), which repeatedly scans the database to generate and process candidate itemsets level by level and thus has a high computational cost.

Han et al. (2000) [6] proposed the frequent-pattern-tree (FP-tree) structure for efficiently mining association rules without the generation of candidate itemsets. In real-world applications, the transaction

databases are continuously updated, with transaction insertion and deletion being common [7, 8, 9]. Efficient maintenance algorithms are thus required for these operations. The efficient incremental fast-updated-frequent-pattern-tree (FUFP-tree) maintenance algorithm for handling transaction insertion was proposed by Hong et al. (2008) [8], who later proposed an FUFP-tree maintenance approach for transaction deletion (Hong et al., 2009) [7].

The latter algorithm is used to maintain the FUFP-tree structure after transaction deletion without reconstructing the FUFP-tree from the beginning. However, the original database needs to be rescanned to determine the occurrence of infrequent items, which are not stored during incremental mining.

This present study enhances the FUFP-tree maintenance algorithm for deleted transactions. Experimental results show that the proposed algorithm has good performance in incremental handling deleted transactions.

II. Review of related works In this section, research related to the proposed

algorithm is briefly reviewed.

A. Frequent pattern tree Han et al. [6] proposed the FP-tree structure for

efficiently mining association rules without the generation of candidate itemsets. FP-tree is used to represent a database in a tree structure, in which only frequent items are kept. The root node of the tree is set to Null. A Header-Table consists of the frequent items sorted in descending order of their frequency as well as pointers which link to their first occurrence nodes in the FP-tree. Nodes that have the same item name are linked in sequence.

2012 Third International Conference on Innovations in Bio-Inspired Computing and Applications

978-0-7695-4837-1/12 $26.00 © 2012 IEEE

DOI 10.1109/IBICA.2012.30

45

B. FUFP-Tree and transaction deletion maintenance approach.

The FUFP-tree (Hong et al. 2009) [7] is constructed

like the FP-tree except that it has bi-directional links between parent nodes and their child nodes. The bi-directional links speed up tree traversal when items are processed in the maintenance process.

When transactions are deleted from the database, the proposed algorithm processes them to maintain the FUFP-tree without reconstructing it from the updated database. Depending on whether items are large in the original database and in the deleted transactions, there are four cases to consider. Each case is processed separately. Header-Table and the FUFP-tree are appropriately updated if necessary.

Table 1. The four cases of transaction deletion

Original DB Deleted Transactions

Case 1 Frequent Frequent Case 2 Frequent Infrequent Case 3 Infrequent Frequent Case 4 Infrequent Infrequent

In the original maintenance process for the FUFP-

tree after transaction deletion, item deletion is conducted before item insertion. The algorithm includes four main parts:

To process case 1 and case 2 are “Part 1: Update the frequency of items in Header-Table” and “Part 2: Update the FUFP-tree according to the set of Reduced-Items”. To process case 4 are “Part 3: Rescan the database to generate the set of Rescan-Items” and “Part 4: Update the FUFP-tree and Header-Table according to the set of Rescan-Items”.

In this algorithm, we have some points need to be improved: In part 2 and part 4, because the items in the corresponding branch are different from each other, once one of the items in Reduced-Items / Rescan-Items appears in one node of the branch, this item and this node should not be considered in subsequent runs. In part 3, because the infrequent items are not stored during the incremental mining process, the original database needs to be rescanned to determine their frequencies. This step is thus the most time-consuming step. The computation time of this step linearly increases with the number of transactions in the original database, the number of items in each transaction (the length of each transaction) and the number of items in the set of Rescan-Items.

III. Enhanced algorithm

A. Notations

D the original database T the set of deleted transactions U the entire updated database d the number of transactions in D t the number of transactions in T I a specific item CountOrg(I) the frequency of I in D CountDel(I) the frequency of I in T CountUpdated(I) the frequency of I in U Htable the current Header-Table of current FUFP-tree FUFP_tree the current FUFP-tree Count(I) a temp variable to store the count of I Reduced_Items list of Reduced-Items to update the FUFP-tree

based on deleted transactions Flist the list of large items of D IFlist the list of small items of D Flist_Del the list of large items of T IFlist_Del the list of small items of T Sup the threshold for frequent itemsets minSup_Org the minimum support of D minSup_Del the minimum support of T minSup the minimum support of U Item_Case3 a temp list to store Items of Case 3 Item_Case3 a temp list to store Items of Case 4 Items a temporary list to store items Nodes a temporary list to store nodes Rescan_Items list of Rescan-Items to update the FUFP-tree based

on the updated database.

B. Proposed algorithm The enhancements described in section II.B are

given in pseudo-code below. INPUT: An original database (D), its

corresponding Header-Table (Htable) storing the frequent items in descending order, its corresponding FUFP-tree (FUFP_tree), a support threshold (Sup), and a set of t deleted transactions (T).

OUTPUT: An updated FUFP-tree for the updated database (U).

CompleteDelete (item) is a function that removes an item completely from the FUFP-Tree by removing all nodes with the same item, connecting the children of these nodes directly to their parent nodes, and finally removing the item from Header-Table. Has_Items (transaction, itemlist) is a function that determines which items of a given item list appear in a given transaction. These items are sorted in descending order of their frequency. A corresponding branch is the branch generated from the frequent items in a transaction according to the order of items appearing in Header-Table.

46

Begin Procedure 1 Scan the deleted transactions T and store large items into Flist_Del and small

items into IFlist_Del; Scan IFlist and Flist_Del to find the items of case 3 and store into Items_Case3. Scan IFlist to find items which do not appear in Flist_Del, and store into Items_Case4.

2 3 4 5 6 7 8 9

10 11

FOR each item I in Htable DO { CountUpdated(I) = CountOrg(I) – CountDel(I);

Flist.Count(I) = CountUpdated(I); IF (CountDel(I) � minSup_Del) THEN // Case 1 & 2 { IF (CountUpdated(I) < minSup) THEN

{ Move I from Flist to IFlist; CompleteDelete (I); Continue with next I;}}

CountOrg(I) = CountUpdated(I); Add I to the set of Reduced_Items;}

Part 1

12 13 14 15 16 17 18 19 20 21

FOR each transaction J in T DO { Items = Has_Items (J, Reduced_Items);

IF (| Items | == 0) THEN // No item of Reduced Items appears in J Continue with next J;

Find the corresponding branch B of J in FUFP_tree; FOR each node N in B DO { IF (N == an item I in Items) THEN

{ Decrease the count of N in the branch by 1; IF (Count (N) == 0) THEN remove N from FUFP_tree;

Remove I from Items;}}}}

Part 2

22 23 24 25 26 27 28 29 30 31 32 33 34

FOR each item I in Items_Case4 DO { // with CountOrg(I) from IFlist, CountDel(I) from IFlist_Del

CountUpdated(I) = CountOrg(I) – CountDel(I); IF (CountUpdated(I) == 0) THEN Remove I from IFlist ELSE { IFlist.Count(I) = CountUpdated(I);

IF (CountUpdated(I) � minSup) THEN { CountOrg(I) = CountUpdated(I);

Add I to Rescan_Items;} Sort the items in Rescan_Items by their CountUpdated(I) in descending order; FOR each item I in Rescan_Items DO { Insert I with its CountUpdated(I) at the end of Htable; Move I from IFlist to Flist;}

Part 3

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

FOR each transaction J in U DO { Items = Has_Items (J, Rescan_Items); IF (| Items | == 0) THEN // No item of Rescan Items appears in J Continue with next J;

Find the corresponding branch B of J in FUFP_tree; Branch = B; // store B to the temporary branch named Branch. FOR each item I in Items DO { IF (I == a node N in Branch) THEN { Increase the count of N in the branch by 1; Remove N from Branch;} ELSE { Insert I as a node at the end of the branch with 1 as its count;

Re-find the corresponding;}}}} FOR each item I in Items_Case3 DO { // with CountOrg(I) from IFlist, CountDel(I) from Flist_Del

CountUpdated(I) = CountOrg(I) – CountDel(I); IF (CountUpdated(I) == 0) THEN

Remove I from IFlist; ELSE

IFlist.Count(I) = CountUpdate(I);}

Part 4

End Procedure

IV. An example An example is shown below to illustrate the

proposed algorithm for maintaining an FUFP-tree after transactions are deleted. Table 2 shows the database used in the example. The original database contains 10 transactions and 12 items, from a to l.

Table 2. Original database used for the example.

No Items No Items 1 a, b, c, d, e, j, k, l 6 d, e, f, g, h 2 a, b, c, d, e, j, k ,l 7 a, b, c, g, h, i 3 a, b, d, e, f 8 a, b, c, h, i 4 a, d, f, j, l 9 a, b, c, h, i 5 d, e, f, g, j, k, l 10 d, e, f, g, j, k

The support threshold is set at 50%. For the given database, minSup_Org is 5, and the frequent 1-itemsets are a, d, b, e, c, and f, from which Header-Table can be constructed. The FUFP-tree is then built from the original database and Header-Table. The results are shown in Figure 1.

Figure 1. FUFP-tree, Header-Table for example Assume that the last four transactions (7 to 10) are

deleted from the original database. The proposed algorithm proceeds as follow:

In the initial steps, the four deleted transactions are first scanned to get the items and their counts. Frequent items are stored in Flist_Del and infrequent items are stored in IFlist_Del based on minSup_Del = 4 × 50% = 2. The frequent items and infrequent items of the original database are stored in Flist and IFlist, respectively, during the FUFP-tree construction. The items of the four cases when deleted transactions are: Case 1: {a, b, c}, Case 2: {d, e, f}, Case 3: {g, h, i} stored in Items_Case3, Case 4: {j, k, l} stored in Items_Case4.

In part 1, line 2, the frequent items which appear in Header-Table are processed. The minSup for an item to be frequent in the updated database is (10 – 4) × 50% = 3. The items in Header-Table are {a:7, d:7, b:6, e:6, c:5, f:5}. Their frequencies in deleted transaction are {a:3, d:1, b:3, e:1, c:3, f:1}. On line 3, the new count of item a in the updated database CountUpdated(a) is 7 – 3 = 4, which is greater than the minimum count. Line 4 updates the frequency of a in Flist to 4. a is thus still frequent for the updated database so lines 6 to 9 are not used. Lines 10 and 11 are activated to update the new count of a in Header-Table to 4 and to place a in the set of Reduced-Items. For item d, which belongs to case 2, CountUpdated(d) is 7 – 1 = 6. The frequency of d in Flist and in Header-Table is updated to 6 on line 4 and line 10. Lines 5 to 9 are not used because the IF statement on line 5 is not satisfied. d is added to Reduced-Items on line 11. The next two items b and e are processed similarly to a and d. For c, CountUpdated(c) = 5 – 3 = 2, which is less than the minSup of the updated database and thus c is no longer frequent. The IF statements in lines 5 and 6 are satisfied, so c is moved from Flist to IFlist and

47

completely removed from the FUFP-tree and Header-Table. Lines 10 and line 11 are not used. Item f, which belongs to case 2, is processed in the same way as item d. Header-Table and FUFP-tree obtained after part 1 are shown in Figure 2, Flist and IFlist are updated, and Reduced-Items = {a, d, b, e, f}.

Figure 2. FUFP-tree, Header-Table, Flist and IFlist after

part 1 has been processed. In part 2, lines 12 to 21, updates the FUFP-tree

according to the deleted transactions with items in Reduced-Items. The deleted transactions and the items appearing in Reduced-Items are shown in Table 3.

Table 3. Deleted transactions and items appear in

Reduced-Items No Deleted trans. Items Cor. branch 7 {a, b, c, g, h, i} a, b a � b 8 {a, b, c, h, i} a, b a � b 9 {a, b, c, h, i} a, b a � b

10 {d, e, f, g, j, k} d, e, f d � e � f Each transaction in the deleted transactions is

processed. For transaction 7, line 13 uses the function Has_Items to get the items of the transaction which appear in Reduced-Items and place them into a temporary item list called Items. The IF statement on line 14 is false because Items = {a, b}. The corresponding branch of this transaction in the FUFP-tree is B = {a � b}. Lines 17 to 21 decrease the frequency of the corresponding nodes in the branch. Each node N in branch B is processed. First, N = {a:7}. N appears in Items. Line 19 decreases the count of N by 1, and thus the new frequency of a in the branch is 7 – 1 = 6. {a} is removed from Items. Because in a branch of the FUFP-tree, each item appears only once, it is unnecessary to consider the item in subsequent runs. Therefore item {a} is pruned from Items after being processed, which speeds up the FUFP-tree updating in part 2 of the algorithm. Items = {b}. The next node, N = {b:3}, is treated the same way. The count of b in the branch is reduced to 2 and {b} in Items is deleted. Like transaction 7, transactions 8 and 9 have a and b appearing Reduced-Items and their corresponding branches are also {a � b}, so they are handled in the same way. After transaction 9 is processed, the frequency of b in the branch is 0. Thus,

it is removed from FUFP-tree on line 20. For transaction 10, function Has_Items on line 13 returns {d, e, f}. The corresponding branch of this transaction in the FUFP-tree is B = {d � e � f}. Lines 17 to 21 decrease the count of nodes d, e and f in the branch from 3, 3, and 3 to 2, 2, and 2, respectively. Header-Table and the FUFP-tree obtained after part 2 are shown in Figure 3. Flist and IFList are unchanged.


part 2 has been processed Part 3, generates the Rescan-Items for case 4. The

items that satisfy case 4 are stored in the Items_Case4 temporary list in the initial steps. Items_Case4 = {j, k, l}. Each item will be processed. On line 24, the count in the updated database of items in Items_Case4 are calculated from the original count, which is taken from IFlist, and the count in the deleted transactions, which is taken from IFlist_Del. For j, CountUpdated(j) = 4 – 1 = 3, which is greater than 0 and equal to minSup of the updated database so j becomes frequent after some transactions are deleted and needs to be rescanned. The frequencies of j in IFlist and in Items_Case4 are updated to 3 on lines 27 and 29, and j is added to Rescan-Items = {j}. With CountUpdated(k) = 3 – 1 = 2, k is still infrequent in the updated database so it is skipped. For item l, CountOrg(l) in IFlist is 3, but l does not appear in any deleted transactions, so CountDel(l) = 0, CountUpdated(l) = 3 – 0 = 3. l becomes frequent in the updated database, and thus the new frequency of l is updated in IFlist and Items_Case4, and l is inserted into Rescan-Items. After line 30, Rescan-Items = {j:3, l:3}. On line 31, the items in Rescan-Items are sorted in descending order in term of their frequency. Here, no sorting is needed. After lines 32 to 34 are executed, {j:3} and {l:3} are added to the end of Header-Table as shown in Figure 4 and {j, l} are moved to Flist.


part 3 has been processed

48

Part 4 updates the FUFP-Tree according to the transactions in the updated database and the Rescan-Items list. Table 4 shows the corresponding branches in the updated database with items in Rescan-Items.

Table 4. Corresponding branches of the updated

database with Rescan-Items

No Updated DB Items Cor. branch 1 {a, b, c, d, e, j, k, l} {j, l} {a � d � b � e} 2 {a, b, c, d, e, j, k, l} {j, l} {a � d � b � e � j � l} 3 {a, b, d, e, f} - - 4 {a, d, e, f, j, l} {j, l} {a � d � f} 5 {d, e, f, g} - - 6 {d, e, f ,g ,h} - -

Transaction 1 is used to illustrate the steps from

lines 35 to 47. The items of transaction 1 appearing in Rescan-Items are stored in a temporary list called Items = {j, l}. The corresponding branch of transaction 1 in the FUFP-Tree is B = {a � d � b � e}. The corresponding branch is stored in a temporary variable Branch. For j, there is no matched item in the branch. Line 46 is executed. {j:1} is inserted at the end of the branch B. The pointer in Header-Table of item j is updated pointing to {j:1}. The branch is recalculated on line 47. The corresponding branch of transaction 1 is updated. Branch = {a � d � b � e � j}. The next item in Items is l, which does not exist in the branch. Therefore, {l:1} is added as a new child node of node j at the end of the branch. Like transaction 1, transaction 2 has {j, l} in Rescan-Items; its corresponding branch in FUFP-tree is {a � d � e � f � j � l}. j appears in the branch, so its frequency in the branch is increased from 1 to 2, and it is removed from Branch because each item appears only once in one branch of the FUFP-tree, so there is no need to rescan a node if it is already processed. l also exists in the branch thus its count in the branch is updated from 1 to 2, and it is deleted from the branch to prevent it from being rescanning to speed up the algorithm. Transaction 3 is discarded because it has no items that need to be rescanned. Transaction 4 has {j, l} in Items. Its corresponding branch is {d � e � f}. There are no matched items in Branch. {j:1} and {l:1} are added at the end of this corresponding branch. The last two transactions (5 and 6) are also ignored because no items need to be rescanned.

Lines 48 to 54 update the frequency in IFlist of the items in case 3. Items_Case3 = {g, h, i}. For g, CountUpdated(g) = 4 – 2 = 2. The frequency of g in IFlist is updated to 2. The same procedure is applied to h. CountUpdated(h) = 4 – 3 = 1, and the count of h in IFlist is set to 1. For i, CountOrg(i) = 3, and CountDel(i) = 3, so CountUpdated(i) = 3 – 3 = 0. The IF statement on line 51

is true and i is removed from IFlist on line 52. Figure 5 shows the final results.

Figure 5. FUFP-tree, Header-Table, Flist and IFlist after part 4 has been processed

V. Experimental results Experiments were programmed in C# on a laptop

with an Intel 1.73 GHz quad-core CPU and 8 GB of RAM, running MS Windows 7 Ultimate 64 bits. Two real datasets called BMS-POS and MUSHROOM were used in the experiments. The BMS-POS contained several years of point-of-sale data from a large electronics retailer. There are 515,597 transactions with 1,657 items in the dataset. The maximal length of a transaction is 164 and the average length of the transactions was 6.5. There are 8,124 transactions with 22 items in the MUSHROOM.

The parameters in this experiment were set to those used by Hong et al. (2009). The first 500,000 transactions were extracted from the BMS-POS database to construct an initial FUFP-tree. For the MUSROOM, the first 8,000 transactions are used. For each run, the last 100 transactions from the last updated database were used as deletion transactions. The minSup was set to 3%, 5%, and 7%, respectively. The execution times obtained from the two algorithms with three different minimum support thresholds are shown in Table 5. Each value is the average execution time of the two algorithms over 5 runs.

The execution time of the proposed approach is much lower than that of the original approach. The main reasons are that the original algorithm counts the original frequencies of infrequent items in the original database whereas the proposed algorithm takes them directly from IFlist, which is created during FUFP-tree constructions. In addition, the items of Reduced-Items and Rescan-Items are stored in a temporary list, and items are pruned step by step after being processed. The proposed algorithm thus makes fewer comparisons when traversing the temporary list.

49

Table 5. Execution time of the two algorithms with different thresholds.

DB % Algorithms Run time (s) of each 100

transactions deleted -100 -200 -300 -400 -500

BM

S-PO

S 3 Hong et al. 12.703 9.184 9.355 9.189 9.145Proposed Algorithm 0.104 0.055 0.054 0.052 0.059

5 Hong et al. 10.861 9.157 9.270 9.173 9.176Proposed Algorithm 0.128 0.054 0.055 0.056 0.055


MU

SHR

OO

M


5 Prop. Alg. 0.353 0.301 0.292 0.253 0.135Proposed Algorithm 0.028 0.019 0.020 0.065 0.017


VI. Conclusions and future work An enhanced FUFP-tree maintenance approach for

transaction deletion was proposed. The algorithm speeds up the calculation of the frequency of infrequent items in the original database by not rescanning the database and the list Rescan-Items. The number of comparisons is reduced when updating the FUFP-tree by traversing the corresponding branches and finding the matched items appearing in Reduced-Items and/or Rescan-Items because the number of items in the set of Reduced-Items and Rescan-Items are stored in a temporary list and pruned step by step. The experimental results for the BMS-POS and the MUSHROOM datasets with different minimum thresholds indicate that the proposed approach decreases the execution time compared to the original algorithm. The numbers of nodes of the FUFP-tree constructed by the two algorithms are the same. The proposed approach requires more memory to keep the list of frequent items and infrequent items of the original database. The proposed algorithm is more efficient for large databases. For small databases with a few thousand of records, such as MUSHROOM, the difference is not very clear. In the future, based on properties and the structure of FUFP-tree, we can improve the performance of the algorithm for handling transactions insertion.

References

[1] R. Agrawal, T. Imielinski, A. Swami. “Database Mining: A Performance Perspective”. IEEE Transactions on Knowledge and Data Engineering, 5(6), pp. 914-925 (1993).

[2] R. Agrawal, R. Srikant. “Fast Algorithms for Mining Association Rules in Large Databases”. In the 20th International Conference on Very Large Databases, pp. 487-499 (1994).

[3] R. Agrawal, R. Srikant, Q. Vu. “Mining association rules with item constraints”. In The third international conference on knowledge discovery in databases and data mining, pp. 67–73 (1997).

[4] T. Fukuda, Y. Morimoto, S. Morishita, T. Tokuyama. “Mining optimized association rules for numeric attributes”. In The ACM Sigact-Sigmod symposium on principles of database systems, pp. 182–191 (1996).

[5] J. Han, Y. Fu. “Discovery of multiple-level association rules from large database”. In The Twenty-�rst international conference on very large data bases, pp. 420–431 (1995).

[6] J. Han, J. Pei, Y. Yin. “Mining Frequent Patterns without Candidate Generation”. SIGMOD Conference, pp. 1-12 (2000).

[7] T.P. Hong, C.W. Lin, Y.L. Wu. “Maintenance of fast updated frequent pattern trees for record deletion”. Computational Statistics & Data Analysis, 53(7), pp. 2485-2499 (2009).

[8] T.P. Hong, C.W. Lin, Y.L. Wu. “Incrementally fast updated frequent pattern trees”. Expert Systems with Applications: An International Journal, 34(4), pp. 2424-2435 (2008).

[9] C.W. Lin, T.P. Hong, Y.L. Wu Lin. “The Pre-FUFP algorithm for incremental mining”. Expert Systems with Applications: An International Journal, 36(5), pp. 9498-9505 (2009).

[10] H. Mannila, H. Toivonen, A.I. Verkamo. “Ef�cient algorithm for discovering association rules”. In The AAAI workshop on knowledge discovery in databases, pp. 181–192 (1994).

[11] J.S. Park, M.S. Chen, P.S. Yu. “Using a hash-based method with transaction trimming for mining association rules”. IEEE Transactions on Knowledge and Data Engineering, 9(5), pp. 812–825 (1997).

50

[IEEE 2012 3rd International Conference on Innovations in Bio-Inspired Computing and Applications...

Documents

Transcript of [IEEE 2012 3rd International Conference on Innovations in Bio-Inspired Computing and Applications...