6asso[3]

Data Mining: Concepts and Techniques Chapter 6

Data Mining: Concepts and Techniques

Chapter 6: Mining Association Rules in Large DatabasesAssociation rule miningMining single-dimensional Boolean association rules from transactional databasesMining multilevel association rules from transactional databasesMining multidimensional association rules from transactional databases and data warehouseFrom association mining to correlation analysisConstraint-based association miningSummary


What Is Association Mining?Association rule mining:Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.Applications:Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.Examples. Rule form: Body Head [support, confidence].buys(x, diapers) buys(x, beers) [0.5%, 60%]major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]


Association Rule: Basic ConceptsGiven: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all rules that correlate the presence of one set of items with that of another set of itemsE.g., 98% of people who purchase tires and auto accessories also get automotive services doneApplications* Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)Home Electronics * (What other products should the store stocks up?)Attached mailing in direct marketingDetecting ping-ponging of patients, faulty collisions


Rule Measures: Support and ConfidenceFind all the rules X & Y Z with minimum confidence and supportsupport, s, probability that a transaction contains {X Y Z}confidence, c, conditional probability that a transaction having {X Y} also contains ZLet minimum support 50%, and minimum confidence 50%, we haveA C (50%, 66.6%)C A (50%, 100%)Customerbuys diaperCustomerbuys bothCustomerbuys beer


Sheet1

Transaction IDItems Bought

2000A,B,C

1000A,C

4000A,D

5000B,E,F

Association Rule Mining: A Road MapBoolean vs. quantitative associations (Based on the types of values handled)buys(x, SQLServer) ^ buys(x, DMBook) buys(x, DBMiner) [0.2%, 60%]age(x, 30..39) ^ income(x, 42..48K) buys(x, PC) [1%, 75%]Single dimension vs. multiple dimensional associations (see ex. Above)Single level vs. multiple-level analysisWhat brands of beers are associated with what brands of diapers?Various extensionsCorrelation, causality analysisAssociation does not necessarily imply correlation or causalityMaxpatterns and closed itemsetsConstraints enforcedE.g., small sales (sum < 100) trigger big buys (sum > 1,000)?


Mining Association RulesAn ExampleFor rule A C:support = support({A C}) = 50%confidence = support({A C})/support({A}) = 66.6%The Apriori principle:Any subset of a frequent itemset must be frequentMin. support 50%Min. confidence 50%


Sheet1

Transaction IDItems Bought

2000A,B,C

1000A,C

4000A,D

5000B,E,F

Sheet1

Frequent ItemsetSupport

{A}75%

{B}50%

{C}50%

{A,C}50%

Mining Frequent Itemsets: the Key StepFind the frequent itemsets: the sets of items that have minimum supportA subset of a frequent itemset must also be a frequent itemseti.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsetIteratively find frequent itemsets with cardinality from 1 to k (k-itemset)Use the frequent itemsets to generate association rules.


The Apriori AlgorithmJoin Step: Ck is generated by joining Lk-1with itselfPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemsetPseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;


The Apriori Algorithm ExampleDatabase DScan DC1L1L2C2C2Scan DC3L3Scan D


Sheet1

TIDItems

1001 3 4

2002 3 5

3001 2 3 5

4002 5

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{4}1

{5}3

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{5}3

Sheet1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

Sheet1

itemsetsup

{1 2}1

{1 3}2

{1 5}1

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemsetsup

{1 3}2

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemset

{2 3 5}

Sheet1

itemsetsup

{2 3 5}2

How to Generate Candidates?Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1 insert into Ckselect p.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1Step 2: pruningforall itemsets c in Ck doforall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck


How to Count Supports of Candidates?Why counting supports of candidates a problem?The total number of candidates can be very huge One transaction may contain many candidatesMethod:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction


Example of Generating CandidatesL3={abc, abd, acd, ace, bcd}Self-joining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4={abcd}


Methods to Improve Aprioris EfficiencyHash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequentTransaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scansPartitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBSampling: mining on a subset of given data, lower support threshold + a method to determine the completenessDynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent


Is Apriori Fast Enough? Performance BottlenecksThe core of the Apriori algorithm:Use frequent (k 1)-itemsets to generate candidate frequent k-itemsetsUse database scan and pattern matching to collect counts for the candidate itemsetsThe bottleneck of Apriori: candidate generationHuge candidate sets:104 frequent 1-itemset will generate 107 candidate 2-itemsetsTo discover a frequent pattern of size 100, e.g., {a1, a2, , a100}, one needs to generate 2100 1030 candidates.Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern


Mining Frequent Patterns Without Candidate GenerationCompress a large database into a compact, Frequent-Pattern tree (FP-tree) structurehighly condensed, but complete for frequent pattern miningavoid costly database scansDevelop an efficient, FP-tree-based frequent pattern mining methodA divide-and-conquer methodology: decompose mining tasks into smaller onesAvoid candidate generation: sub-database test only!


Construct FP-tree from a Transaction DBmin_support = 0.5TIDItems bought (ordered) frequent items100{f, a, c, d, g, i, m, p}{f, c, a, m, p}200{a, b, c, f, l, m, o}{f, c, a, b, m}300 {b, f, h, j, o}{f, b}400 {b, c, k, s, p}{c, b, p}500 {a, f, c, e, l, p, m, n}{f, c, a, m, p}Steps:Scan DB once, find frequent 1-itemset (single item pattern)Order frequent items in frequency descending orderScan DB again, construct FP-tree


Benefits of the FP-tree StructureCompleteness: never breaks a long pattern of any transactionpreserves complete information for frequent pattern miningCompactnessreduce irrelevant informationinfrequent items are gonefrequency descending ordering: more frequent items are more likely to be sharednever be larger than the original database (if not count node-links and counts)Example: For Connect-4 DB, compression ratio could be over 100


Mining Frequent Patterns Using FP-treeGeneral idea (divide-and-conquer)Recursively grow frequent pattern path using the FP-treeMethod For each item, construct its conditional pattern-base, and then its conditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)


Major Steps to Mine FP-treeConstruct conditional pattern base for each node in the FP-treeConstruct conditional FP-tree from each conditional pattern-baseRecursively mine conditional FP-trees and grow frequent patterns obtained so farIf the conditional FP-tree contains a single path, simply enumerate all the patterns


Step 1: From FP-tree to Conditional Pattern BaseStarting at the frequent header table in the FP-treeTraverse the FP-tree by following the link of each frequent itemAccumulate all of transformed prefix paths of that item to form a conditional pattern baseConditional pattern basesitemcond. pattern basecf:3afc:3bfca:1, f:1, c:1mfca:2, fcab:1pfcam:2, cb:1


Properties of FP-tree for Conditional Pattern Base ConstructionNode-link propertyFor any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree headerPrefix path propertyTo calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai.


Step 2: Construct Conditional FP-tree For each pattern-baseAccumulate the count for each item in the baseConstruct the FP-tree for the frequent items of the pattern basem-conditional pattern base:fca:2, fcab:1All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3


Mining Frequent Patterns by Creating Conditional Pattern-Bases


Step 3: Recursively mine the conditional FP-treeCond. pattern base of am: (fc:3)Cond. pattern base of cm: (f:3){}f:3cm-conditional FP-treeCond. pattern base of cam: (f:3){}f:3cam-conditional FP-tree


Single FP-tree Path GenerationSuppose an FP-tree T has a single path PThe complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P{}f:3c:3a:3m-conditional FP-treeAll frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam


Principles of Frequent Pattern GrowthPattern growth propertyLet be a frequent itemset in DB, B be 's conditional pattern base, and be an itemset in B. Then is a frequent itemset in DB iff is frequent in B. abcdef is a frequent pattern, if and only ifabcde is a frequent pattern, andf is frequent in the set of transactions containing abcde


Why Is Frequent Pattern Growth Fast?Our performance study showsFP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projectionReasoningNo candidate generation, no candidate testUse compact data structureEliminate repeated database scanBasic operation is counting and FP-tree building


FP-growth vs. Apriori: Scalability With the Support ThresholdData set T25I20D10K


Chart1

22

43

46

524

566

7500

110.3

180.2

440.1

D1 FP-growth runtime

D1 Apriori runtime

Support threshold(%)

Run time(sec.)

Sheet1

dataset T25I10D10k item=1k

FP-tree/FP-mineTreeProjectionComparison

threshold(%)#tree nodestree size(k)#stack nodesstack size(k)#trans(FP)#tree nodesmax width#transmatrices costfp/lexi treestack/widthlexi/fp trans

523034553230345533296173734312495681315.53315.531.31

4727611745727611745825771831839186973038397.60397.601.11

31461193507146119350715596639339316422148777371.80371.801.05

221896852552188515252228746732683240778730933299.14320.431.05

12684466443256093614627404457811000557415269663046.44256.092.03

0.828377668112587466210286271930019697046093535114930.51131.412.46

0.53585048604261771628334510028647561412844755892687212.5146.633.72

0.357507613802267947643153101269654131412007235728161858.2620.393.78

0.21059473254272735056564978119127070295162757934750178118.349.272.82

0.1268092164342279803671525038933391221267594461514609136977.912.211.78




21018822445101870244520173914313033161421641131712.46783.621.64

11009635242319958992390211050085256104739870832750956301192.09951.193.61

0.8137053532893134386832253146157710230193462021634034686178133.97694.864.24

0.5203932048944196776747226211608427067528211607208204318718275.34372.545.49

0.3252033060488232748155860255857272640124062057613046477459234.70187.618.04

0.130063537215825941226225929763641515433054527400206133656480819.8484.939.21

dataset T25I10D10k item=1kdataset T25I10D100k item=1k

threshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP costthreshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP cost

5198240020114855112110.751.2821515635076391361942986140.901.23

44520785584282420607960.622.621666291277946564469576597470.974.51

3100231793296655992441020.655.150.89429618492627969981034582331.035.59

217518331020812342253472630.707.660.5155409319159871770242373272401.147.44

122462442046018237390080730.818.820.3199892422556982753103687389961.388.73

0.823218461853819747406276690.858.800.1239233511939173518854815280571.479.41

0.525875513138325132436604110.978.51

0.333842628702933896478740991.007.61

0.255016940138252178597613610.956.36

0.1143191238758621728861420208151.215.95

threshold(%)D1 running mem. Req.D2 running mem. Req.

252522445

1614623902

0.8621032253

0.5628347226

0.3643155860

0.1671562259

Sheet1

D1 running mem. Req.



Running memory requirement (Kb)

Sheet2

T25I10D10k 1k items

threshold(%)D1 #frequent itemsetsD1 #weighted lengthD1 longestD2 #frequent itemsetsD2 #weighted lengthD2 longest

3393111911

27321.0831431.13

1.514822.18611363.27

157813.791052563.539

0.893003.6311102303.8911

0.5286474.0912270674.0912

0.3696544.4313726404.7414

0.21270704.3813997074.6614

0.13391223.84141515434.6414

Sheet2

00

00

00

00

00

00

00

00

00

D1 #frequent itemsets



Number of frequent itemsets

Sheet3

0000

0000

0000

0000

0000

0000

0000

0000

0000

D1 #weighted length

D1 longest

D2 #weighted length

D2 longest


Length of frequent itemsets

Sheet4

threshold(%)D1 FP-growth runtimeD1 Apriori runtimeD2 FP-growth runtimeD2 Apriori runtimeD1 #frequent itemsetsD2 #frequent itemsetsD1 runtime/itemsetD2 runtime/itemset

3221012393190.00508905850.5263157895

24311217321430.00546448090.0769230769

1.5461564148211360.00269905530.0132042254

152426225578152560.00086490230.0049467275

0.8566359300102300.00053763440.0034213099

0.575005828647270670.00024435370.0021428308

0.3119169654726400.00015792350.0012527533

0.218107127070997070.00014165420.0010731443

0.1441373391221515430.00012974680.0009040338

Sheet4

00

00

00

00

00

00

00

00

00


D1 Apriori runtime


Run time(sec.)

Sheet5

00

00

00

00

00

00

00

00

00


D2 Apriori runtime


Run time(sec.)

Sheet6

00

00

00

00

00

00

00

00

00

D1 runtime/itemset

D2 runtime/itemset


Runtime/itemset(sec.)

Size(K)FP-growthApriori

1014

20311

30517

50731

801251

1001564

00

00

00

00

00

00

FP-growth

Apriori

Number of transactions (K)

Run time (sec.)

D1 10k

threshold(%)D1 FP-growthD1 TreeProjection

322

244

1.545

157

0.859

0.5718

0.31131

0.21846

0.14498

D2 100k


31011

21111

1.51521

12657

0.835123

0.558300

0.391

0.2107

0.1137

scalability1.00%

sizeFP-growthTreeProjection

1033

2059

30815

501327

802045

1002657

00

00

00

00

00

00

00

00

00

D1 FP-growth

D1 TreeProjection

Support threshold (%)

Run time (sec.)

00

00

00

00

00

00

00

00

00

D2 FP-growth

D2 TreeProjection


Runtime (sec.)

00

00

00

00

00

00

FP-growth

TreeProjection


Run time (sec.)

Support threshold(%)#nodes in FP-tree#frequent item occurrencesRatioFP-tree MemDB MemMem ratio

9036313757483789.9487125502992631.6565656566

80151517787821174.11363607115128195.6855885589

7026151930953738.41627607723812123.0690248566

6081052147771264.99194520859108444.165556241

50134492219609165.04322776887843627.5064936674

4037162231038762.17891888924154810.3617808514

30123368242560519.66296083297024203.2769235134

2037387326563797.118972952106255161.1841717196

1080747327925353.4619379352111701400.5763938856

00

00

00

00

00

00

00

00

00

#nodes in FP-tree

#frequent item occurrences

FP-growth vs. Tree-Projection: Scalability with Support ThresholdData set T25I20D100K


Chart3

1011

1111

1521

2657

35123

58300

910.3

1070.2

1370.1

D2 FP-growth

D2 TreeProjection


Runtime (sec.)

Sheet1




523034553230345533296173734312495681315.53315.531.31

4727611745727611745825771831839186973038397.60397.601.11

31461193507146119350715596639339316422148777371.80371.801.05

221896852552188515252228746732683240778730933299.14320.431.05

12684466443256093614627404457811000557415269663046.44256.092.03

0.828377668112587466210286271930019697046093535114930.51131.412.46

0.53585048604261771628334510028647561412844755892687212.5146.633.72

0.357507613802267947643153101269654131412007235728161858.2620.393.78

0.21059473254272735056564978119127070295162757934750178118.349.272.82

0.1268092164342279803671525038933391221267594461514609136977.912.211.78




21018822445101870244520173914313033161421641131712.46783.621.64

11009635242319958992390211050085256104739870832750956301192.09951.193.61

0.8137053532893134386832253146157710230193462021634034686178133.97694.864.24

0.5203932048944196776747226211608427067528211607208204318718275.34372.545.49

0.3252033060488232748155860255857272640124062057613046477459234.70187.618.04

0.130063537215825941226225929763641515433054527400206133656480819.8484.939.21

dataset T25I10D10k item=1kdataset T25I10D100k item=1k

threshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP costthreshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP cost

5198240020114855112110.751.2821515635076391361942986140.901.23

44520785584282420607960.622.621666291277946564469576597470.974.51

3100231793296655992441020.655.150.89429618492627969981034582331.035.59

217518331020812342253472630.707.660.5155409319159871770242373272401.147.44

122462442046018237390080730.818.820.3199892422556982753103687389961.388.73

0.823218461853819747406276690.858.800.1239233511939173518854815280571.479.41

0.525875513138325132436604110.978.51

0.333842628702933896478740991.007.61

0.255016940138252178597613610.956.36

0.1143191238758621728861420208151.215.95

threshold(%)D1 running mem. Req.D2 running mem. Req.

252522445

1614623902

0.8621032253

0.5628347226

0.3643155860

0.1671562259

Sheet1

00

00

00

00

00

00




Running memory requirement (Kb)

Sheet2

T25I10D10k 1k items

threshold(%)D1 #frequent itemsetsD1 #weighted lengthD1 longestD2 #frequent itemsetsD2 #weighted lengthD2 longest

3393111911

27321.0831431.13

1.514822.18611363.27

157813.791052563.539

0.893003.6311102303.8911

0.5286474.0912270674.0912

0.3696544.4313726404.7414

0.21270704.3813997074.6614

0.13391223.84141515434.6414

Sheet2

00

00

00

00

00

00

00

00

00




Number of frequent itemsets

Sheet3

0000

0000

0000

0000

0000

0000

0000

0000

0000

D1 #weighted length

D1 longest

D2 #weighted length

D2 longest


Length of frequent itemsets

Sheet4

threshold(%)D1 FP-growth runtimeD1 Apriori runtimeD2 FP-growth runtimeD2 Apriori runtimeD1 #frequent itemsetsD2 #frequent itemsetsD1 runtime/itemsetD2 runtime/itemset

3221012393190.00508905850.5263157895

24311217321430.00546448090.0769230769

1.5461564148211360.00269905530.0132042254

152426225578152560.00086490230.0049467275

0.8566359300102300.00053763440.0034213099

0.575005828647270670.00024435370.0021428308

0.3119169654726400.00015792350.0012527533

0.218107127070997070.00014165420.0010731443

0.1441373391221515430.00012974680.0009040338

Sheet4

00

00

00

00

00

00

00

00

00


D1 Apriori runtime


Run time(sec.)

Sheet5

00

00

00

00

00

00

00

00

00


D2 Apriori runtime


Run time(sec.)

Sheet6

00

00

00

00

00

00

00

00

00

D1 runtime/itemset

D2 runtime/itemset


Runtime/itemset(sec.)

Size(K)FP-growthApriori

1014

20311

30517

50731

801251

1001564

00

00

00

00

00

00

FP-growth

Apriori


Run time (sec.)

D1 10k


322

244

1.545

157

0.859

0.5718

0.31131

0.21846

0.14498

D2 100k


31011

21111

1.51521

12657

0.835123

0.558300

0.391

0.2107

0.1137

scalability1.00%

sizeFP-growthTreeProjection

1033

2059

30815

501327

802045

1002657

00

00

00

00

00

00

00

00

00

D1 FP-growth

D1 TreeProjection


Run time (sec.)

00

00

00

00

00

00

00

00

00

D2 FP-growth

D2 TreeProjection


Runtime (sec.)

00

00

00

00

00

00

FP-growth

TreeProjection


Run time (sec.)

Support threshold(%)#nodes in FP-tree#frequent item occurrencesRatioFP-tree MemDB MemMem ratio

9036313757483789.9487125502992631.6565656566

80151517787821174.11363607115128195.6855885589

7026151930953738.41627607723812123.0690248566

6081052147771264.99194520859108444.165556241

50134492219609165.04322776887843627.5064936674

4037162231038762.17891888924154810.3617808514

30123368242560519.66296083297024203.2769235134

2037387326563797.118972952106255161.1841717196

1080747327925353.4619379352111701400.5763938856

00

00

00

00

00

00

00

00

00

#nodes in FP-tree

#frequent item occurrences

Presentation of Association Rules (Table Form )


Visualization of Association Rule Using Plane Graph


Visualization of Association Rule Using Rule Graph


Iceberg QueriesIcerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain thresholdExample:select P.custID, P.itemID, sum(P.qty)from purchase Pgroup by P.custID, P.itemIDhaving sum(P.qty) >= 10Compute iceberg queries efficiently by Apriori:First compute lower dimensionsThen compute higher dimensions only when all the lower ones are above the threshold


Multiple-Level Association RulesItems often form hierarchy.Items at the lower level are expected to have lower support.Rules regarding itemsets at appropriate levels could be quite useful.Transaction database can be encoded based on dimensions and levelsWe can explore shared multi-level mining


Mining Multi-Level AssociationsA top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%]. Then find their lower-level weaker rules: 2% milk wheat bread [6%, 50%].Variations at mining multiple-level association rules.Level-crossed association rules: 2% milk Wonder wheat breadAssociation rules with multiple, alternative hierarchies: 2% milk Wonder bread


Multi-level Association: Uniform Support vs. Reduced SupportUniform Support: the same minimum support for all levels+ One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. Lower level items do not occur as frequently. If support threshold too high miss low level associationstoo low generate too many high level associationsReduced Support: reduced minimum support at lower levelsThere are 4 search strategies:Level-by-level independentLevel-cross filtering by k-itemsetLevel-cross filtering by single itemControlled level-cross filtering by single item


Uniform SupportMulti-level mining with uniform supportMilk[support = 10%]2% Milk [support = 6%]Skim Milk [support = 4%]Level 1min_sup = 5%Level 2min_sup = 5%Back


Reduced SupportMulti-level mining with reduced support2% Milk [support = 6%]Skim Milk [support = 4%]Level 1min_sup = 5%Level 2min_sup = 3%BackMilk[support = 10%]


Multi-level Association: Redundancy FilteringSome rules may be redundant due to ancestor relationships between items.Examplemilk wheat bread [support = 8%, confidence = 70%]2% milk wheat bread [support = 2%, confidence = 72%]We say the first rule is an ancestor of the second rule.A rule is redundant if its support is close to the expected value, based on the rules ancestor.


Multi-Level Mining: Progressive DeepeningA top-down, progressive deepening approach: First mine high-level frequent items: milk (15%), bread (10%) Then mine their lower-level weaker frequent itemsets: 2% milk (5%), wheat bread (4%)Different min_support threshold across multi-levels lead to different algorithms:If adopting the same min_support across multi-levelsthen toss t if any of ts ancestors is infrequent.If adopting reduced min_support at lower levelsthen examine only those descendents whose ancestors support is frequent/non-negligible.


Progressive Refinement of Data Mining QualityWhy progressive refinement?Mining operator can be expensive or cheap, fine or roughTrade speed with quality: step-by-step refinement.Superset coverage property: Preserve all the positive answersallow a positive false test but not a false negative test.Two- or multi-step mining:First apply rough/cheap operator (superset coverage)Then apply expensive algorithm on a substantially reduced candidate set (Koperski & Han, SSD95).


Progressive Refinement Mining of Spatial Association RulesHierarchy of spatial relationship:g_close_to: near_by, touch, intersect, contain, etc.First search for rough relationship and then refine it.Two-step mining of spatial association:Step 1: rough spatial computation (as a filter) Using MBR or R-tree for rough estimation.Step2: Detailed spatial algorithm (as refinement) Apply only to those objects which have passed the rough spatial association test (no less than min_support)


Multi-Dimensional Association: ConceptsSingle-dimensional rules:buys(X, milk) buys(X, bread)Multi-dimensional rules: 2 dimensions or predicatesInter-dimension association rules (no repeated predicates)age(X,19-25) occupation(X,student) buys(X,coke)hybrid-dimension association rules (repeated predicates)age(X,19-25) buys(X, popcorn) buys(X, coke)Categorical Attributesfinite number of possible values, no ordering among valuesQuantitative Attributesnumeric, implicit ordering among values


Techniques for Mining MD AssociationsSearch for frequent k-predicate set:Example: {age, occupation, buys} is a 3-predicate set.Techniques can be categorized by how age are treated.1. Using static discretization of quantitative attributesQuantitative attributes are statically discretized by using predefined concept hierarchies.2. Quantitative association rulesQuantitative attributes are dynamically discretized into binsbased on the distribution of the data.3. Distance-based association rulesThis is a dynamic discretization process that considers the distance between data points.


Static Discretization of Quantitative AttributesDiscretized prior to mining using concept hierarchy.Numeric values are replaced by ranges.In relational database, finding all frequent k-predicate sets will require k or k+1 table scans.Data cube is well suited for mining.The cells of an n-dimensional cuboid correspond to the predicate sets.Mining from data cubes can be much faster.


Quantitative Association Rulesage(X,30-34) income(X,24K - 48K) buys(X,high resolution TV)Numeric attributes are dynamically discretizedSuch that the confidence or compactness of the rules mined is maximized.2-D quantitative association rules: Aquan1 Aquan2 AcatCluster adjacent association rulesto form general rules using a 2-D grid.Example:


ARCS (Association Rule Clustering System)How does ARCS work?

1. Binning

2. Find frequent predicateset

3. Clustering

4. Optimize


Limitations of ARCSOnly quantitative attributes on LHS of rules.Only 2 attributes on LHS. (2D limitation)An alternative to ARCSNon-grid-basedequi-depth binningclustering based on a measure of partial completeness.Mining Quantitative Association Rules in Large Relational Tables by R. Srikant and R. Agrawal.


Mining Distance-based Association RulesBinning methods do not capture the semantics of interval data

Distance-based partitioning, more meaningful discretization considering:density/number of points in an intervalcloseness of points in an interval


Sheet1

Price($)Equi-width(width $10)Equi-depth(depth 2)Distance-based

7[0,10][7,20][7,7]

20[11,20][22,50][20,22]

22[21,30][51,53][50,53]

50[31,40]

51[41,50]

53[51,60]

S[X] is a set of N tuples t1, t2, , tN , projected on the attribute set XThe diameter of S[X]:

distx:distance metric, e.g. Euclidean distance or Manhattan

Clusters and Distance Measurements


The diameter, d, assesses the density of a cluster CX , where

Finding clusters and distance-based rulesthe density threshold, d0 , replaces the notion of supportmodified version of the BIRCH clustering algorithm

Clusters and Distance Measurements(Cont.)


Interestingness MeasurementsObjective measuresTwo popular measurements: support; and confidence

Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting ifit is unexpected (surprising to the user); and/oractionable (the user can do something with it)


Criticism to Support and ConfidenceExample 1: (Aggarwal & Yu, PODS98)Among 5000 students3000 play basketball3750 eat cereal2000 both play basket ball and eat cerealplay basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence


Sheet1

basketballnot basketballsum(row)

cereal200017503750

not cereal10002501250

sum(col.)300020005000

MBD00A1A217.unknown

Criticism to Support and Confidence (Cont.)Example 2:X and Y: positively correlated,X and Z, negatively relatedsupport and confidence of X=>Z dominates We need a measure of dependent or correlated events

P(B|A)/P(B) is also called the lift of rule A => B


Sheet1

X11110000

Y11000000

Z01111111

Sheet1

RuleSupportConfidence

X=>Y25%50%

X=>Z37.50%75%

Other Interestingness Measures: InterestInterest (correlation, lift)taking both P(A) and P(B) in considerationP(A^B)=P(B)*P(A), if A and B are independent eventsA and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated


Sheet1

X11110000

Y11000000

Z01111111

Sheet1

ItemsetSupportInterest

X,Y25%2

X,Z37.50%0.9

Y,Z12.50%0.57

Chapter 6: Mining Association Rules in Large DatabasesAssociation rule miningMining single-dimensional Boolean association rules from transactional databasesMining multilevel association rules from transactional databasesMining multidimensional association rules from transactional databases and data warehouseFrom association mining to correlation analysisConstraint-based association miningSummary


Constraint-Based MiningInteractive, exploratory mining giga-bytes of data? Could it be real? Making good use of constraints!What kinds of constraints can be used in mining?Knowledge type constraint: classification, association, etc.Data constraint: SQL-like queries Find product pairs sold together in Vancouver in Dec.98.Dimension/level constraints:in relevance to region, price, brand, customer category.Rule constraintssmall sales (price < $10) triggers big sales (sum > $200).Interestingness constraints:strong rules (min_support 3%, min_confidence 60%).


Rule Constraints in Association MiningTwo kind of rule constraints: Rule form constraints: meta-rule guided mining. P(x, y) ^ Q(x, w) takes(x, database systems). Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD98).sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 10001-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD99): 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above. 2-var: A constraint confining both sides (L and R).sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)


Constrain-Based Association QueryDatabase: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },where C is a set of constraints on S1, S2 including frequency constraintA classification of (single-variable) constraints:Class constraint: S A. e.g. S ItemDomain constraint:S v, { , , , , , }. e.g. S.Price < 100v S, is or . e.g. snacks S.TypeV S, or S V, { , , , , } e.g. {snacks, sodas } S.TypeAggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg}, and { , , , , , }.e.g. count(S1.Type) 1 , avg(S2.Price) 100


Constrained Association Query Optimization ProblemGiven a CAQ = { (S1, S2) | C }, the algorithm should be :sound: It only finds frequent sets that satisfy the given constraints Ccomplete: All frequent sets satisfy the given constraints C are foundA nave solution:Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one.Our approach:Comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation.


Anti-monotone and Monotone ConstraintsA constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-patterns of S can satisfy CaA constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also satisfies it


Succinct ConstraintA subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate p, where is a selection operatorSP2I is a succinct power set, if there is a fixed number of succinct set I1, , Ik I, s.t. SP can be expressed in terms of the strict power sets of I1, , Ik using union and minusA constraint Cs is succinct provided SATCs(I) is a succinct power set


Convertible ConstraintSuppose all items in patterns are listed in a total order RA constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies CA constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w.r.t. R also satisfies C


Relationships Among Categories of ConstraintsSuccinctnessAnti-monotonicityMonotonicityConvertible constraintsInconvertible constraints


Property of Constraints: Anti-MonotoneAnti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.Examples: sum(S.Price) v is anti-monotonesum(S.Price) v is not anti-monotonesum(S.Price) = v is partly anti-monotoneApplication:Push sum(S.price) 1000 deeply into iterative frequent set computation.


Characterization of Anti-Monotonicity ConstraintsS v, { , , }v SS VS VS Vmin(S) vmin(S) vmin(S) vmax(S) vmax(S) vmax(S) vcount(S) vcount(S) vcount(S) vsum(S) vsum(S) vsum(S) vavg(S) v, { , , }(frequent constraint)yesnonoyespartlynoyespartlyyesnopartlyyesnopartlyyesnopartlyconvertible(yes)


Example of Convertible Constraints: Avg(S) VLet R be the value descending order over the set of itemsE.g. I={9, 8, 6, 4, 3, 1}Avg(S) v is convertible monotone w.r.t. RIf S is a suffix of S1, avg(S1) avg(S){8, 4, 3} is a suffix of {9, 8, 4, 3}avg({9, 8, 4, 3})=6 avg({8, 4, 3})=5If S satisfies avg(S) v, so does S1{8, 4, 3} satisfies constraint avg(S) 4, so does {9, 8, 4, 3}


Property of Constraints: SuccinctnessSuccinctness:For any set S1 and S2 satisfying C, S1 S2 satisfies CGiven A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 , i.e., it contains a subset belongs to A1 , Example : sum(S.Price ) v is not succinctmin(S.Price ) v is succinctOptimization:If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is not affected by the iterative support counting.


Characterization of Constraints by SuccinctnessS v, { , , }v SS VS VS Vmin(S) vmin(S) vmin(S) vmax(S) vmax(S) vmax(S) vcount(S) vcount(S) vcount(S) vsum(S) vsum(S) vsum(S) vavg(S) v, { , , }(frequent constraint)Yesyesyesyesyesyesyesyesyesyesyesweaklyweaklyweakly nononono(no)


Why Is the Big Pie Still There?More on constraint-based mining of associations Boolean vs. quantitative associationsAssociation on discrete vs. continuous dataFrom association to correlation and causal structure analysis.Association does not necessarily imply correlation or causal relationshipsFrom intra-trasanction association to inter-transaction associationsE.g., break the barriers of transactions (Lu, et al. TOIS99). From association analysis to classification and clustering analysisE.g, clustering association rules


SummaryAssociation rule mining probably the most significant contribution from the database community in KDDA large number of papers have been publishedMany interesting issues have been exploredAn interesting research directionAssociation analysis in other types of data: spatial data, multimedia data, time series data, etc.


6asso[3]

Documents

Transcript of 6asso[3]