2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts...
-
Upload
willis-simmons -
Category
Documents
-
view
285 -
download
5
Transcript of 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts...
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 1
Mining Frequent Patterns, Associations, and Correlations
(Chapter 5)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 2
Chapter 5Mining Frequent Patterns, Associations, and Correlations
What is association rule mining
Mining single-dimensional Boolean association
rules
Mining multilevel association rules
Mining multidimensional association rules
From association mining to correlation analysis
Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 3
What Is Association Mining?
Association rule mining: Finding frequent patterns, associations,
correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
Rule form: “Body ead [support, confidence]”.Examples. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”) [2%,
60%]
Applications: Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, classification, etc.
二〇二三年四月二十一日 Introduction to Data Mining 4
Denotation of Association Rules
Rule form: “Body ead [support, confidence]”.
(association, correlation and causality)Rule examples
sales(T, “computer”) sales(T, “software”) [support = 1%, confidence = 75%]
buy(T, “Beer”) buy(T, “Diaper”)
[support = 2%, confidence = 70%] age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”)
[support = 2%, confidence = 60%]
data context
attribute name
data content
data object
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 5
Association Rule Mining(Basic Concepts)
Given: a database or set of transactions each transaction is a list of items (e.g.,
purchased by a customer in a visit) Find: all rules that correlate the presence of a set of
items with that of another set of items (X Y) E.g., 98% of people who purchase tires and auto
accessories also get car maintenance done Application examples
? Car Maintenance Agreement (What the store should do to boost Car Maintenance Agreement sales)
Home Electronics ? (What other products should the store stocks up?)
Association Rule Mining(Support and Confidence)
Given a transaction D/B, find all the rules X Y with minimum support and confidence support S : probability that a
transaction contains {X & Y } confidence C : conditional
probability that a transaction having {X} also contains Y
Transaction ID Items Bought0001 A,B,C0002 A,C0003 A,D0004 B,E,F
I = {i1,i2,i3, ...,in} : set of all items
Tj I : a transaction,
A C (50%, 66.6%) C A (50%, 100%)
Customerbuys X
Customerbuys both
Customerbuys Y
n
1jjTI
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 7
A Road Map for Association Rule Mining(Types of Association Rules)
Boolean vs. quantitative associations (based on the types of data values) buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”)
[0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]
Single dimensional vs. multiple dimensional associations (see above e.g.)
Single-level vs. multiple-level analysis What brands of beers are associated with what brands of diapers?
Various extensions Correlation, causality analysis
Association does not necessarily imply correlation or causality Constraints enforced
E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 8
Chapter 5: Mining Association Rules in Large Databases
What is association rule mining
Mining single-dimensional Boolean association
rules
Mining multilevel association rules
Mining multidimensional association rules
From association mining to correlation analysis
Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 9
Mining Association Rules(How to Find Strong Rules : An Example)
1).Find the frequent itemsets Frequent itemset The set of items that has minimum support
2).Generate association rules by frequent itemsets Rule form : “Body Head [support, confidence]” with min. sup. and conf. Example : for rule A C : find supports of {A ,C } and {A}
support = support({A ,C }) = 50% > min. supportconfidence = support({A ,C })/support({A}) = 66.6% > min. confidence
Problem : How many itemsets & rules ? A C; …; C A ; …; A, B C ; …
Transaction ID Items Bought1001 A,B,C1002 A,C1003 A,D1004 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Find strong rules with- support >= 50%- confidence >= 50%
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 10
Mining Association Rules Efficiently : (Apriori Algorithm)
1. Find the frequent itemsets (Key Steps)
Iteratively find frequent itemsets with cardinality from 1 to k (from 1-itemset to k-itemset) by Apriori principle
Frequent itemset
the set of items that has minimum support Apriori principle
Any subset of a frequent itemset must also be frequent
i.e., if {A, B } is a frequent 2-itemset, both {A } and {B } should be frequent 1-itemsets. If either {A } or {B } is not frequent, {A, B } cannot be frequent.
2. Generate association rules by the frequent itemsets
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 11
The Apriori Algorithm — Example( support >= 50% ~ support count >= 2)
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2
Scan D
C3C3itemset
{2 3 5}Scan D itemset sup
{2 3 5} 2L3 itemset sup
{2 3 5} 2
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 12
The Apriori Algorithm(Pseudo-Code for Finding Frequent Itemsets)
Iteratively : C1 → L1 → C2 → L2 → C3 → L3 …… → CN →
Ck (like a table): The set of candidate itemsets of size k Lk (like a table): The set of frequent itemsets of size k
Pseudo-code:C1 = {candidate 1-itemsets}; L1 = {frequent 1-itemsets};for (k = 1; Lk !=; k++) do begin
Ck+1 = candidate itemset generated from Lk
for each transaction t in database do increment the count of all itemsets in Ck+1 contained in t
Lk+1 = itemsets in Ck+1 with support >= min_supportEnd
Return k Lk ;
scanDatabaseCk+1 Lk+1
join & prune
Lk Ck+1
Example
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 13
Generate Candidate Itemsets Ck+1
from Lk
(Lk Ck+1 : Join & Prune)
Representation : Lk Ck+1 Lk : The set of frequent itemsets of size k Ck+1 : The set of candidate itemsets of size k+1
Steps of computing candidate k+1-itemsets Ck+1
1) Join Step (According to Apriori principle)Ck+1 is generated by joining Lk with itself
(Note the length constraint : k k+1 )
2) Prune Step (Apply Apriori principle to prune)
Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset(Use Lk to prune Ck +1)
Example
Suppose the items in Lk are listed (sorted) in order
Step 1: self-joining Lk (SQL pseudo code)
insert into Ck+1 (itemi ’s are sorted)
select p.item1 , p.item2 , …, p.itemk , q.itemk
from Lk p, Lk q (where p, q are aliases of Lk )
where p.item1=q.item1 , …, p.itemk-1=q.itemk-1 ,
p.itemk < q.itemk
Step 2: pruning : Any k-itemset that is not frequent cannot be a
subset of a frequent (k+1)-itemset (Ck+1 is like a table)
For every (k+1)-itemset c in Ck+1 do
For every k-subset s of (k+1)-itemset c doif (s is not in Lk) then delete (k+1)-itemset c from Ck+1
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 14
Generation of Candidate Itemset Ck+1
(Effective pseudo code)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 15
Generating Candidate Itemsets C4
(Example : L3 → C4 )
L3={abc, abd, acd, ace, bcd}, C4 ? p : L3 q : L3
Self-joining: L3*L3 abc abc abcd ( from abc+abd ) C4 abd abd abce ( from abc+ace ) C4 acd acd acde ( from acd+ace ) C4 ace ace
Pruning: bcd bcd acde is removed from C4
because ade L3 ,or cde L3
{3-subsets of acde have acd, ace, ade, cde}
Therefore, C4={abcd}
.
.
.
pair-wise joining
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 16
Is Apriori Fast Enough ? (Performance Bottlenecks)
The core of the Apriori algorithm: Compute Ck (Lk-1 → Ck ) : Use frequent (k – 1)-itemsets to
generate candidate k-itemsets (candidate generation) Compute Lk (Ck→ Lk ) : Use database scan and pattern
matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation
Huge candidate itemsets for computing Ck : 104 frequent 1-itemset would generate ~107 candidate 2-
itemsetsWhat would be the number of 3-itemsets, 4-itemsets, …, ?
For a transaction database of 100 items, e.g., {a1, a2, …, a100}, there are 2100 1030 possible candidate itemsets.
Multiple scans of database for computing Lk : Needs N scans, N is the length of the longest pattern
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 17
Methods to Improve Apriori’s Efficiency
Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent
Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB (parallel
computing)
Sampling: mining on a subset of the given data, lower support
threshold + a method to determine the completeness
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 18
Mining Frequent Patterns by FP-Growth (Without Candidate Generation and DB Scans)
Avoid costly database scans
Compress a large database into a compact, Frequent-Pattern tree structure (FP-tree)
Highly condensed, but complete for frequent pattern mining
Avoid costly candidate generation
Develop an efficient, FP-tree-based frequent pattern mining method
A divide-and-conquer methodology: decompose mining tasks into smaller ones
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 19
FP-growth vs. Apriori (Scalability with the Support
Threshold)
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
On data set T25I20D10K
For data set TxIyDz, •x indicates the average transaction length •y indicates the average itemset length •z indicates the total number of transaction instances
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 20
Major Steps of FP-growth Algorithm
1) Construct FP-tree
2) Construct conditional pattern base for each item
in the header table of FP-tree
3) Construct conditional FP-tree from each entry
of conditional pattern base
4) Derive frequent patterns by mining conditional
FP-trees recursively
If the conditional FP-tree contains a single
path, simply enumerate all the patterns
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 21
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 50%Support count >=2.5TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps :
1. Scan DB once, find frequent 1-itemset
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
,
-1
-2
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 22
Benefits of the FP-tree Structure
Completeness: never breaks a long pattern of any transaction preserves complete information for frequent
pattern mining Compactness
reduce irrelevant information—infrequent items are gone
frequency descending ordering: more frequent items are more likely to be shared
never be larger than the original database (if not count node-links and counts)
Example: For Connect-4 DB, compression ratio could be over 100
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 23
Step 2: Construct Conditional Pattern Base
Starting from the 2nd item of header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of each frequent item
to form an entry of the conditional pattern base
Conditional pattern base
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 24
Properties of FP-tree for Conditional Pattern Base Construction
Node-link property
For a frequent item ai , all the frequent itemsets
that contain ai can be obtained by following ai's
node-links, starting from ai's entry in header
table of the FP-tree. Prefix path property
To calculate the frequent itemsets containing ai
in a path P, only the prefix sub-path of node ai
in P need to be accumulated, and its frequency
count should carry the same count as node ai.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 25
Step 3: Construct Conditional FP-tree
For each conditional pattern-base entry (illustrated by m) Accumulate the count for each item of the entry Construct the conditional FP-tree for the frequent
items of the entry
m-conditional pattern base: fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 26
Conditional Pattern-Bases & Conditional FP-tree
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
Assume minimal support = 3, and | denotes concatenation
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 27
Suppose a conditional FP-tree T has a single path P All frequent itemsets of T can be generated by
enumerating all the item combinations in path P
{}
f:3
c:3
a:3
a single path for an m-conditional FP-tree
All frequent itemsets containing m
m,
fm, cm, am,
fcm, fam, cam,
fcam
Frequent Pattern Generation
all item combinations in P
Step 4: Recursively Mine Conditional FP-trees
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 28
Why Is FP–Growth Algorithm Fast?
Performance study shows
FP-growth is an order of magnitude faster
than Apriori algorithm.
Reasoning : FP-tree building
Use compact data structure
No candidate generation, no candidate test
Eliminate repeated database scan
Basic operation is counting
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 29
Presentation of Association Rules (Table Form )
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 30
Visualization of Association Rule Using Plane Graph
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 31
Visualization of Association Rule Using Rule Graph
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 32
Chapter 5: Mining Association Rules in Large Databases
What is association rule mining
Mining single-dimensional Boolean association
rules
Mining multilevel association rules
Mining multidimensional association rules
From association mining to correlation analysis
Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 33
Multiple-Level Association Rules
Rules regarding itemsets at
appropriate levels could be quite useful.
Items at the lower level are expected to have lower support.
Transaction database can be encoded based on dimensions and levels
We can explore multi-level mining
WonderFraser
skim 2%
Food
breadmilk
whitewheat
TID Items T1 {111, 121, 211, 221} T2 {111, 211, 222, 323} T3 {112, 122, 221, 411} T4 {111, 121} T5 {111, 122, 211, 221, 413}
Items often form concept hierarchy. …
…
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 34
Mining Multi-Level Associations
A top_down, progressive deepening approach: First find high-level strong rules:
milk bread [20%, 60%]. Then find their lower-level “weaker” rules:
2% milk wheat bread [6%, 50%]. Variations at mining multiple-level association
rules. Level-crossed association rules:
2% milk Wonder wheat bread Association rules with multiple, alternative
concept hierarchies: 2% milk Wonder bread
hierarchy
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 35
Uniform Support vs. Reduced Support(Multi-level Association Mining)
Uniform Support:
Use the same minimum support for mining all levels
Pros: Simple - use one minimum support threshold
If an itemset contains any item whose ancestors do not have minimum support, there is no need to examine it.
Cons: Lower level items do not occur as frequently. If support threshold is too high miss low level associations too low generate too many high-level associations
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 36
Uniform Support Example
Multi-level mining with uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5% Χ
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 37
Uniform Support vs. Reduced Support(Multi-level Association Mining)
Reduced Support: Use reduced minimum support at lower levelsThere are 3 main search strategies: Level-by-level independent (without (i-1)th level info)
Each tree node (item) is examined, regardless of whether or not its parent node is found to be frequent (too loose → not efficient)
Level-cross filtering by k-itemset (at Lk with (i-1)th level info of k-item)
A k-itemset at the i-th level is examined iff its parent k-itemset at the (i-1)th level is frequent (efficient, but too restrictive → many missing rules)
Level-cross filtering by single item (at Lk with (i-1)th level info of 1-item)
An item at the i-th level is examined iff its parent node at the (i-1)-th level is significant (may not be frequent, comprise of above two strategies)
Example
Example
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 38
Reduced Support
Multi-level mining with reduced support(Level-cross filtering by single item)
Case 1Level 1 : min_sup = 8%Level 2 : min_sup = 4%
Case 2Level 1 : min_sup = 12%Level 2 : min_sup = 6%
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Milk
[support = 10%]
1:2:╳
1:2:
1: 2:╳
In case 2, level-1 parent node is significant but not frequent.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 39
Redundancy Filtering in Multi-level Association Mining
Some rules may be redundant due to relationships between the “ancestors” of items.
Example of a redundant rule milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to or less than
the “expected” value, based on the rule’s ancestor.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 40
Progressive Deepening in Multi-Level Association Mining
A top-down, progressive deepening approach: First mine high-level frequent items:
milk (15%), bread (10%) Then mine their lower-level “weaker” frequent
itemsets: 2% milk (5%), wheat bread (4%) Different min_support threshold strategies across
multi-levels lead to different algorithms: If adopting the same min_support across multi-levels,
then toss item t if any of t ’s ancestors is infrequent. If adopting reduced min_support at lower levels,
then examine only those descendents whose ancestor’s support is frequent or non-negligible.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 41
Progressive Refinement of Data Mining Quality
Why progressive refinement? Mining operator can be expensive or cheap, fine or rough
Trade speed with quality: step-by-step refinement.
Two- or multi-step mining: First apply rough/cheap operator (superset coverage)
Then apply expensive algorithm on a substantially
reduced candidate set (Koperski & Han, SSD’95).
Superset coverage property: Preserve all the positive answers—allow a false positive
test but not a false negative test.
+ -
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 42
Chapter 5: Mining Association Rules in Large Databases
What is association rule mining
Mining single-dimensional Boolean association
rules
Mining multilevel association rules
Mining multidimensional association rules
From association mining to correlation analysis
Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 43
Multi-Dimensional Association (Concepts)
Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) Hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) Categorical Attributes
finite number of possible values, no ordering among values Quantitative Attributes
numeric, implicit ordering among values, e.g., age
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 44
Mining Quantitative Associations
For example : search for frequent k-predicate set An example: {age, occupation, buys} is a 3-predicate set.
Techniques can be categorized by how quantitative attributes (e.g., age) are treated.1. Static discretization
Quantitative attributes are statically discretized by using predefined concept hierarchies, where numeric values are replaced by ranges Example, for salary, 0~20K, 21K~30K, 31K~40K,>= 41K
2. Dynamic discretizationQuantitative attributes are dynamically discretized into bins individually based on the distribution of the data. Example, by equi-width / equi-depth bins
3. Distance-based (clustering) discretizationThis is a dynamic discretization process that considers the distance between data points. (See slide 49 for example)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 45
Static Discretization of Quantitative Attributes
Before mining, discretizing data using concept hierarchy. Numeric values are replaced by ranges.
Range example for salary : 0~20K, 21K~30K, 31K~40K,>= 41K In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional
cuboid correspond to the predicate sets.
The cells of an n-dimensional cuboid store the support counts of thecorresponding n-predicate sets
(income)(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 46
Quantitative Association Rule Mining
Numeric attributes are dynamically discretized such that the
confidence or compactness of the mined rules is maximized.
For 2D quantitative association rule: Aquan1 Bquan2 Ccat
Cluster “adjacent” association rules to form more general rules using a 2-D grid.
Example: age(X,”34”) income(X,”30K - 40K”) buys(X,”TV”)
age(X,”34”) income(X,”40K - 50K”) buys(X,”TV”)
age(X,”35”) income(X,”30K - 40K”) buys(X,”TV”)
age(X,”35”) income(X,”40K - 50K”) buys(X,”TV”)
age(X,”34-35”) income(X,”30K - 50K”) buys(X,”TV”) (more compact)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 47
ARCS (Association Rule Clustering System)
How does ARCS work?
1. Binning
2. Find frequent predicate set
3. Clustering
4. Optimize
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 48
Limitations of ARCS
Only quantitative attributes on LHS of rules. Only 2 attributes on LHS. (2D limitation) An alternative to ARCS
Non-grid-based equi-depth binning clustering based on a measure of partial
completeness. “Mining Quantitative Association Rules in
Large Relational Tables” by R. Srikant and R. Agrawal.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 49
Mining Distance-based Association Rules
Binning methods do not capture the semantics of interval data
Distance-based partitioning, more meaningful discretization considering: density/”number of points” in an interval “closeness” of points in an interval
Price($)Equi-width(width $10)
Equi-depth(depth 2) Distance-based
7 [0,10] Χ 1 [7,20] Χ 2 [7,7]20 [11,20] Χ 1 [22,50] Χ 2 [20,22] Χ 222 [21,30] Χ 1 [51,53] Χ 2 [50,53] Χ 350 [31,40] Χ 051 [41,50] Χ 153 [51,60] Χ 2
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 52
Chapter 5: Mining Association Rules in Large Databases
What is association rule mining
Mining single-dimensional Boolean association
rules
Mining multilevel association rules
Mining multidimensional association rules
From association mining to correlation analysis
Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 53
Interestingness Measurements
Objective measuresTwo popular measurements: support confidence
Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 54
Criticism to Support and Confidence(Example 1)
Example 1: (Aggarwal & Yu, PODS98) Totally 5000 students
(Analyzed in 2D space : playing basketball, eating cereal)3000 play basketball3750 eat cereal2000 both play basket ball and eat cereal
The following association rules are foundplay basketball eat cereal [40%, 66.7% ] play basketball not eat cereal [20%, 33.3%]
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
: given: derived
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 55
Criticism to Support and Confidence
Rule 1 : play basketball eat cereal [40%, 66.7% ] Rule 1 implies that a student who plays basketball
like eating cereal This rule is misleading.
Because the percentage of students eating cereal is 75% (=3750/5000) which is higher than 66.7%.
Rule 2 : play basketball not eat cereal [20%, 33.3%] Rule 2 is far more accurate. It implies that a student who plays basketball
doesn’t like eating cereal
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 56
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence
Measure of dependent/correlated events: lift
89.05000/3750*5000/3000
5000/2000),( CBlift
)()(
)(
CPBP
CBPlift
33.15000/1250*5000/3000
5000/1000),( CBlift
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
B: basketball C: cereal
Lift(B,C) < lift(B,¬C) a student playing basketball doesn’t like eating cereal
B CA
D
(C/A>D/B)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 57
Criticism to Support and Confidence(Example 2)
X and Y: positively correlated X and Z, negatively related
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%
1 : occurred, 0 : not occurred
We need a measure of dependent or correlated events
(Cont.)
)A(P/)B|A(P)B(P/)A|B(P)B(P)A(P
)BA(Pcorr B,A
Z is more positively correlated to X than Y according to sup. & conf.
1).
2).
contradict
)Asup(/)AB(conf)Bsup(/)BA(conflift B,A
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 58
Correlation (lift) :
taking both P(A) and P(B) in consideration P(A^B)=P(B)*P(A), if A and B are independent
events A and B negatively correlated, if corrA,B is less
than 1; otherwise A and B positively correlated
)()(
)(, BPAP
BAPcorr BA
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Itemset Support corr(lift)X,Y 1/4 2X,Z 3/8 6/7Y,Z 1/8 4/7
P(X)=1/2, P(Y)=1/4, P(Z)=7/8
)BA(Psupport BA
)B(P)A(P
)BA(Pliftcorr BAB,A
Criticism to Support and Confidence(Example 2)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 59
Constraint-Based Mining
Interactive, exploratory mining giga-bytes of big data? Could it be real? — Making good use of
constraints! What kinds of constraints can be used in mining?
Knowledge type constraint: Specify the type knowledge to be mined
Data constraint: specify the relevant data Find product pairs sold together in Vancouver in Dec.’98.
Dimension (attribute) constraints: specify the desired dimension
Regarding region, price, brand, customer category. Rule constraints: specify the form of rule
small sales (price < $10) triggers big sales (sum > $200). Interestingness constraints: specify the threholds
strong rules (min_support 3%, min_confidence 60%).
KDD Steps
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 60
Rule Constraints in Association Mining
Two kind of rule constraints: Rule form constraints: meta-rule guided mining.
P(x, y) ^ Q(x, w) takes(x, “database systems”). Rule content constraint: constraint-based query
optimization (Ng, et al., SIGMOD’98). sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3
1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99): 1-var: A constraint confining only one side (LHS
or RHS) of the rule, e.g., as shown above. 2-var: A constraint confining both sides (LHS and
RHS). sum(LHS) < min(RHS), and max(RHS) < 5* sum(LHS)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 61
Chapter 5: Mining Association Rules in Large Databases
What is association rule mining
Mining single-dimensional Boolean association
rules
Mining multilevel association rules
Mining multidimensional association rules
From association mining to correlation analysis
Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 62
Why Is the Big Pie Still There?
More on constraint-based mining of associations Boolean vs. quantitative associations
Association on discrete vs. continuous data From association to correlation and causal
structure analysis. Association does not necessarily imply correlation or
causal relationships From intra-transaction association to inter-
transaction associations E.g., break the barriers of transactions (Lu, et al.
TOIS’99). From association analysis to classification and
clustering analysis E.g, clustering association rules
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 63
Summary
Association rule mining
probably the most significant contribution from
the database community in KDD
A large number of papers have been published
Many interesting issues have been explored
An interesting research direction
Association analysis in other types of data:
spatial data, multimedia data, time series data,
etc.
二〇二三年四月二十一日 Introduction to Data Mining 64
Relevant Data
Data Preprocessing
Data Mining
Evaluation/Presentation
Pattern
Knowledge
Databases
Steps in KDD Process(Technically)
Data mining
The core step of KDD process