Frequent Item Mining - Kent State Universityjin/DM08/FIM.pdf · 3 Definion: Frequent Itemset •...
Transcript of Frequent Item Mining - Kent State Universityjin/DM08/FIM.pdf · 3 Definion: Frequent Itemset •...
FrequentItemMining
Whatisdatamining?
• =Pa6ernMining?• Whatpa6erns?
• Whyaretheyuseful?
3
Defini>on:FrequentItemset• Itemset
– Acollec>onofoneormoreitems• Example:{Milk,Bread,Diaper}
– k‐itemset• Anitemsetthatcontainskitems
• Supportcount(σ)– Frequencyofoccurrenceofanitemset
– E.g.σ({Milk,Bread,Diaper})=2
• Support– Frac>onoftransac>onsthatcontainanitemset
– E.g.s({Milk,Bread,Diaper})=2/5
• FrequentItemset– Anitemsetwhosesupportisgreaterthanor
equaltoaminsupthreshold
FrequentItemsetsMining
TID Transactions 100 { A, B, E } 200 { B, D } 300 { A, B, E } 400 { A, C } 500 { B, C } 600 { A, C } 700 { A, B } 800 { A, B, C, E } 900 { A, B, C } 1000 { A, C, E }
• Minimumsupportlevel50%– {A},{B},{C},{A,B},{A,C}
FrequentPa6ernMining
B
A
E
A B
C
C
FB
D
F
F
D
EA B
A
C
AE
D
C
F
D
A
B
A
C
E
A
D
A B
D C
A
A B
B
DD
CC
A B
D C
BeyondItemsets • SequenceMining
– Findingfrequentsubsequencesfromacollec>onofsequences
• GraphMining– Findingfrequent(connected)subgraphsfromacollec>onof
graphs
• TreeMining– Findingfrequent(embedded)subtreesfromasetoftrees/
graphs
• GeometricStructureMining– Findingfrequentsubstructuresfrom3‐Dor2‐Dgeometric
graphs
• Amongothers…
WhyFrequentPa6ernMiningisSoImportant?
• Applica>onDomains– Business,biology,chemistry,WWW,computer/networingsecurity,…
• Summarizingtheunderlyingdatasets,providingkeyinsights• Basictoolsforotherdataminingtasks
– Assoca>onrulemining
– Classifica>on– Clustering– ChangeDetec>on– etc…
Network motifs: recurring patterns that occur significantly more than in randomized nets
• Domo>fshavespecificrolesinthenetwork?
• Manypossibledis>nctsubgraphs
The 13 three-node connected subgraphs
199 4-node directed connected subgraphs
Anditgrowsfastforlargersubgraphs:93645‐nodesubgraphs,1,530,8436‐node…
Finding network motifs – an overview
• Genera>onofasuitablerandomensemble(referencenetworks)
• Networkmo>fsdetec>onprocess:
Count how many times each subgraph appears
Compute statistical significance for each subgraph – probability of appearing in random as much as in real network (P-val or Z-score)
Real=5 Rand=0.5±0.6
Zscore(#StandardDeviaPons)=7.5
Ensembleofnetworks
ThreeDifferentViewsofFIM
• Transac>onalDatabase– Howwedostoreatransac>onaldatabase?• Horizontal,Ver>cal,Transac>on‐ItemPair
• BinaryMatrix• Bipar>teGraph
• HowdoestheFIMformulatedinthesedifferentse`ngs?
13
14
FrequentItemsetGenera>on
Givenditems,thereare2dpossiblecandidateitemsets
15
FrequentItemsetGenera>on• Brute‐forceapproach:– Eachitemsetinthela`ceisacandidatefrequentitemset– Countthesupportofeachcandidatebyscanningthedatabase
– Matcheachtransac>onagainsteverycandidate
– Complexity~O(NMw)=>ExpensivesinceM=2d!!!
16
ReducingNumberofCandidates• Aprioriprinciple:– Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent
• Aprioriprincipleholdsduetothefollowingpropertyofthesupportmeasure:
– Supportofanitemsetneverexceedsthesupportofitssubsets
– Thisisknownasthean>‐monotonepropertyofsupport
17
FoundtobeInfrequent
Illustra>ngAprioriPrinciple
Prunedsupersets
18
Illustra>ngAprioriPrincipleItems (1-itemsets)
Pairs (2-itemsets)
(No need to generate candidates involving Coke or Eggs)
Triplets (3-itemsets) Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C3 = 41
With support-based pruning, 6 + 6 + 1 = 13
Apriori
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994
20
HowtoGenerateCandidates?
• SupposetheitemsinLk‐1arelistedinanorder
• Step1:self‐joiningLk‐1insertintoCkselectp.item1,p.item2,…,p.itemk‐1,q.itemk‐1
fromLk‐1p,Lk‐1q
wherep.item1=q.item1,…,p.itemk‐2=q.itemk‐2,p.itemk‐1<q.itemk‐1
• Step2:pruningforallitemsetscinCkdo
forall(k‐1)‐subsetssofcdo
if(sisnotinLk‐1)thendeletecfromCk
21
ChallengesofFrequentItemsetMining
• Challenges– Mul>plescansoftransac>ondatabase
– Hugenumberofcandidates
– Tediousworkloadofsupportcoun>ngforcandidates
• ImprovingApriori:generalideas
– Reducepassesoftransac>ondatabasescans– Shrinknumberofcandidates
– Facilitatesupportcoun>ngofcandidates
22
CompactRepresenta>onofFrequentItemsets
• Someitemsetsareredundantbecausetheyhaveiden>calsupportastheirsupersets
• Numberoffrequentitemsets
• Needacompactrepresenta>on
23
MaximalFrequentItemset
BorderInfrequentItemsets
MaximalItemsets
Anitemsetismaximalfrequentifnoneofitsimmediatesupersetsisfrequent
24
ClosedItemset
• Anitemsetisclosedifnoneofitsimmediatesupersetshasthesamesupportastheitemset
25
MaximalvsClosedItemsetsTransacPonIds
NotsupportedbyanytransacPons
26
MaximalvsClosedFrequentItemsets
Minimumsupport=2
#Closed=9
#Maximal=4
Closedandmaximal
Closedbutnotmaximal
27
MaximalvsClosedItemsets
ResearchQues>ons
• HowtoefficientlyenumerateMaximalFrequentItemsets?
• HowaboutClosedFrequentItemsets?
28
29
Alterna>veMethodsforFrequentItemsetGenera>on
• Representa>onofDatabase– horizontalvsver>caldatalayout
30
ECLAT
• Foreachitem,storealistoftransac>onids(>ds)
TID‐list
31
ECLAT• Determinesupportofanyk‐itemsetbyintersec>ng>d‐listsof
twoofits(k‐1)subsets.
• 3traversalapproaches:– top‐down,bo6om‐upandhybrid
• Advantage:veryfastsupportcoun>ng• Disadvantage:intermediate>d‐listsmaybecometoolargefor
memory
∧ →
32
FP‐growthAlgorithm
• Useacompressedrepresenta>onofthedatabaseusinganFP‐tree
• OnceanFP‐treehasbeenconstructed,itusesarecursivedivide‐and‐conquerapproachtominethefrequentitemsets
33
FP‐treeconstruc>onnull
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
A]erreadingTID=1:
A]erreadingTID=2:
34
FP‐TreeConstruc>on
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1 C:3
D:1
D:1
E:1 E:1
PointersareusedtoassistfrequentitemsetgeneraPon
D:1 E:1
TransacPonDatabase
Headertable
35
FP‐growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1 C:3
D:1
D:1
CondiPonalPa`ernbaseforD:P={(A:1,B:1,C:1),
(A:1,B:1),(A:1,C:1),(A:1),(B:1,C:1)}
RecursivelyapplyFP‐growthonP
FrequentItemsetsfound(withsup>1):AD,BD,CD,ACD,BCD
D:1