1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C....
-
Upload
cameron-hodge -
Category
Documents
-
view
212 -
download
0
Transcript of 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C....
1
Efficient Algorithms for Mining Share-Frequent Itemsets
Authors: Y. C. Li, J. S. Yeh and C. C. ChangSpeaker: Yu-Chiang LiDate :July 28, 2005
2
Outline
Introduction Related Work Enhanced Fast Share Measure (EFSM) Algo
rithm Support-Counted Fast Share Measure (
SuFSM) Algorithm Share-Counted Fast Share Measure (
ShFSM) Algorithm Experimental Results Conclusions
3
Introduction (1/2) Goal: discovering the buying patterns of cu
stomers Itemset: a group of items (products) boug
ht together in a transaction Support: the ratio of transactions containi
ng the itemset to the total transaction number (limited in informative feedback)
Share: the ratio of the total count of items in the itemset to the total count of items in the database
4
Introduction (2/2) Share-confidence framework: providing us
eful information about numerical values associated with transaction items ( Carter et al., 1997)
Share-frequent (SH-frequent) itemset: usually includes some infrequent subsets
Fast Share Measure (FSM) algorithm discovers share-frequent itemsets on small dataset efficiently
This study proposes Enhanced FSM, SuFSM and ShFSM to discover share-frequent itemsets more efficiently than that of FSM
5
Related Work Support-Confidence Framework (Agrawal et al., 1993)
Each item is a binary variable denoting whether an item was purchased
Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms
Pattern-growth algorithms (Han et al., 2000; Han et al, 2004)
Share-Confidence Framework (Carter et al., 1997) Support-confidence framework does not analyze the
exact number of products purchased The support count method does not measure the prof
it or cost of an itemset Exhaustive search algorithm (Carter et al., 2000) FSM algorithm (Li et al., 2005)
6
Related Work
Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%
7
Share-Confidence Framework Measure value: mv(ip, Tq)
mv({D}, T01) = 1 mv({C}, T03) = 3
Transaction measure value: tmv(Tq) = tmv(T02) = 9
Total measure value: Tmv(DB)= Tmv(DB)=44
Itemset measure value: imv(X, Tq)= imv({A, E}, T02)=4
Local measure value: lmv(X)= lmv({BC})=2+4+5=11
xq dbT
qTXimv ),(
dbT Ti
qpq qp
Timv ),(
XiTX
qppq
Timv,
),(
qp Ti
qp Timv ),(
8
Tmv
Xlmv )(
minShare=30%
Itemset share: SH(X)= SH({BC})=11/44=25%
SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset
9
Existing algorithms
ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) Variants of exhaustive search Prune the candidate itemsets whose local
measure values are exactly zero FSM(Fast Share Measure) (Li et al., 2005)
Fast on a small dataset Generate too many candidates
Existing algorithms are inefficient on a large datasets
10A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 BDE:0BCE:0 CDE:0
ABCD:4 ABCE:0 ABDE:0 ACDE:0 BCDE:0
ABCDE:0ZP
Algorithm
11A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0
ABCD:4 ACDE:0ZSP Algorithm
12
FSM: Fast Share Measure Algorithm
ML: Maximum transaction length in DB MV: Maximum measure value in DB Let min_lmv=minShare×Tmv Let CF(X)FSM= lmv(X)+(lmv(X)/k)×MV ×(ML-
k) If CF(X)FSM< min_lmv, all supersets of X are infr
equent
13
FSM: Fast Share Measure Algorithm
A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0
minShare=30%, ML=6, MV=3, TMV=44 min_lmv=14 Prune X if CF(X)FSM <min_lmv Let X={A B C} CF(X)FSM =3+(3/3)×3×(6-3)=12<14=min_lmv
14
Enhanced FSM (EFSM) Algorithm EFSM: instead of joining arbitrary two itemsets in RC
k-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently
Reduce time complexity from O(n2k-2) to O(nk)
15
SuFSM (Support-counted FSM)
Xk+1: arbitrary superset of X with length k+1 in DB S(Xk+1): the set which contains all Xk+1 in DB dbS(Xk+1): the set of transactions of which each tra
nsaction contains at least one Xk+1 SuFSM and ShFSM from EFSM which prune the c
andidates more efficiently than FSM SuFSM (Support-counted FSM):
Theorem 1. If lmv(X)+Sup(S(Xk+1))×MV×(ML – k)< min_lmv, all supersets of X are infrequent
16
SuFSM (Support-counted FSM)
lmv(X)/k Sup(X) Sup(S(Xk+1))
EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2
If there is no superset of X is an SH-frequent itemset, then the following three equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv
17
ShFSM (Share-counted FSM)
dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1
ShFSM (Share-counted FSM): Theorem 2. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X
are infrequent FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv ShFSM: Tmv(dbS(Xk+1)) < min_lmv CF(X)FSM>=CF(X)SuFSM>=CF(X)ShFSM
18
FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv
SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv
ShFSM: Tmv(dbS(Xk+1)) < min_lmv Ex. X = {BCD} CF(X)FSM = 9+(9/3)×3×(6-3)=36 CF(X)SuFSM = 9+2×3×(6-3)=18 CF(X)ShFSM = 6+8=14
19
ShFSM (Share-counted FSM)
A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ACE:16 BCD:15 CDE:0
Ex. X={AB} Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T0
5) =6+6=12 <14 = min_lmv
20
Experimental Results (1/3)
PC: Pentium IV 1.5 GHZ, 1.5GB SDRAM, running Windows XP professional
All algorithms were coded in VC++ 6.0
T4.I2.D100k.N50.S10
110
100
100010000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec)
ZSPEZSPFSMEFSMSuFSMShFSM
T6.I4.D100k.N200.S10
110
100
100010000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec)
FSMEFSMSuFSMShFSM
Figure 1
Figure 2
21
Experimental Results (2/3)
T6.I4.Dz.N200.S10
1
10
100
1000
10000
0 200 400 600 800 1000
Transactions (k)R
unni
ng ti
me
(sec
)
FSM
EFSM
SuFSM
ShFSM
T10.I6.D100k.N500.S20
110
100
100010000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec) .
FSMEFSMSuFSMShFSM
minShare=0.1%
Figure 3
Figure 4
22
ExperimentalResults (3/3)
T6.I4.D100k.N200.S10
minShare = 0.1% ML=20 , MV=10 Tmv=2,302,443
MethodPass (k)
FSM EFSM SuFSM ShFSM Fk
k=1Ck 200 200 200 200
159RCk 200 200 199 197
k=2Ck 19900 19900 19701 19306
1844RCk 16214 16214 13312 7199
k=3Ck 829547 829547 564324 190607
101RCk 251877 251877 99765 9792
k=4Ck 3290296 3290296 793042 20913
0RCk 332877 332877 41057 1420
k=5Ck 393833 393833 25003 1050
5RCk 71420 71420 19720 959
k=6Ck 26137 26137 11582 518
8RCk 25562 25562 11045 506
k=7Ck 11141 11141 5940 204
7RCk 11099 11099 5827 196
k=8Ck 4426 4426 2797 58
1RCk 4423 4423 2750 54
k>=9Ck 2036 2036 1567 12
0RCk 2030 2030 1513 10
Time(sec) 13610.4 71.55 29.67 10.95
23
Conclusions
This study proposes the Enhanced FSM (EFSM) algorithm to efficiently reduce the time complexity of the join step
We have also developed SuFSM and ShFSM from EFSM
SuFSM and ShFSM can efficiently prune the candidates, and significantly improve the performance
The experimental results have indicated that ShFSM has the best performance
In the future, we plan to develop even more advanced algorithms to accelerate the process of identifying all share-frequent itemsets
24
Thank You