Data Mining: Concepts and Techniques
description
Transcript of Data Mining: Concepts and Techniques
1
January 17, 2001 Data Mining: Concepts and Techniques 1
Data Mining: Concepts and Techniques
— Slides for Textbook —— Chapter 6 —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.caJanuary 17, 2001 Data Mining: Concepts and Techniques 2
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 3
What Is Association Mining?
n Association rule mining:n Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
n Applications:n Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.n Examples. n Rule form: “Body → Η ead [support, confidence]”.n buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]n major(x, “CS”) ^ takes(x, “DB”) → grade(x, “A”) [1%,
75%]January 17, 2001 Data Mining: Concepts and Techniques 4
Association Rule: Basic Concepts
n Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
n Find: all rules that correlate the presence of one set of items with that of another set of itemsn E.g., 98% of people who purchase tires and auto
accessories also get automotive services donen Applications
n * ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)
n Home Electronics ⇒ * (What other products should the store stocks up?)
n Attached mailing in direct marketingn Detecting “ping-pong”ing of patients, faulty “collisions”
January 17, 2001 Data Mining: Concepts and Techniques 5
Rule Measures: Support and Confidence
n Find all the rules X & Y ⇒ Z with minimum confidence and supportn support, s, probability that a
transaction contains {X 4 Y 4Z}
n confidence, c, conditional probability that a transaction having {X 4 Y} also contains Z
Transaction IDItems Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Let minimum support 50%, and minimum confidence 50%, we haven A ⇒ C (50%, 66.6%)
n C ⇒ A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
January 17, 2001 Data Mining: Concepts and Techniques 6
Association Rule Mining: A Road Map
n Boolean vs. quantitative associations (Based on the types of values handled)n buys(x, “SQLServer”) ^ buys(x, “DMBook”) → buys(x, “DBMiner”)
[0.2%, 60%]n age(x, “30..39”) ^ income(x, “42..48K”) → buys(x, “PC”) [1%, 75%]
n Single dimension vs. multiple dimensional associations (see ex. Above)n Single level vs. multiple-level analysis
n What brands of beers are associated with what brands of diapers?n Various extensions
n Correlation, causality analysisn Association does not necessarily imply correlation or causality
n Maxpatterns and closed itemsetsn Constraints enforced
n E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
2
January 17, 2001 Data Mining: Concepts and Techniques 7
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 8
Mining Association Rules—An Example
For rule A ⇒ C:support = support({A 4C}) = 50%
confidence = support({A 4C})/support({A}) = 66.6%The Apriori principle:
Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
January 17, 2001 Data Mining: Concepts and Techniques 9
Mining Frequent Itemsets: the Key Step
n Find the frequent itemsets: the sets of items that have
minimum support
n A subset of a frequent itemset must also be a
frequent itemsetn i.e., if {AB} is a frequent itemset, both {A} and {B} should
be a frequent itemset
n Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
n Use the frequent itemsets to generate association rules.
January 17, 2001 Data Mining: Concepts and Techniques 10
The Apriori Algorithm
n Join Step: Ck is generated by joining Lk-1with itself
n Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
n Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;for each transaction t in database do
increment the count of all candidates in Ck+1that are contained in t
Lk+1 = candidates in Ck+1 with min_supportend
return ∪k Lk;
January 17, 2001 Data Mining: Concepts and Techniques 11
The Apriori Algorithm — Example
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2
Scan D
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
January 17, 2001 Data Mining: Concepts and Techniques 12
How to Generate Candidates?
n Suppose the items in Lk-1 are listed in an order
n Step 1: self-joining Lk-1
insert intoCk
select p.item1 , p.item2 , …, p.i temk- 1, q.i temk -1
from Lk-1 p, Lk-1 q
where p.item1 =q. i tem1 , …, p.i temk -2 =q.i temk- 2, p.i temk- 1 < q.i temk-1
n Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
3
January 17, 2001 Data Mining: Concepts and Techniques 13
How to Count Supports of Candidates?
n Why counting supports of candidates a problem?
n The total number of candidates can be very huge
n One transaction may contain many candidates
n Method:
n Candidate itemsets are stored in a hash-tree
n Leaf node of hash-tree contains a list of itemsets and counts
n Interior node contains a hash table
n Subset function: finds all the candidates contained in a transaction
January 17, 2001 Data Mining: Concepts and Techniques 14
Example of Generating Candidates
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3
n abcd from abc and abd
n acde from acd and ace
n Pruning:
n acde is removed because ade is not in L3
n C4={abcd}
January 17, 2001 Data Mining: Concepts and Techniques 15
Methods to Improve Apriori’s Efficiency
n Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent
n Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
n Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
n Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
n Dynamic itemset counting: add new candidate itemsets only when
all of their subsets are estimated to be frequent
January 17, 2001 Data Mining: Concepts and Techniques 16
Is Apriori Fast Enough? — Performance Bottlenecks
n The core of the Apriori algorithm:n Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsetsn Use database scan and pattern matching to collect counts for the
candidate itemsets
n The bottleneck of Apriori: candidate generation
n Huge candidate sets:n 104 frequent 1-itemset will generate 107 candidate 2-itemsetsn To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 ≈ 1030 candidates.
n Multiple scans of database: n Needs (n +1 ) scans, n is the length of the longest pattern
January 17, 2001 Data Mining: Concepts and Techniques 17
Mining Frequent Patterns Without Candidate Generation
n Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
n highly condensed, but complete for frequent pattern mining
n avoid costly database scans
n Develop an efficient, FP-tree-based frequent pattern mining method
n A divide-and-conquer methodology: decompose mining tasks into smaller ones
n Avoid candidate generation: sub-database test only!
January 17, 2001 Data Mining: Concepts and Techniques 18
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 0.5
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p }200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p }
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
4
January 17, 2001 Data Mining: Concepts and Techniques 19
Benefits of the FP-tree Structure
n Completeness: n never breaks a long pattern of any transactionn preserves complete information for frequent pattern
miningn Compactness
n reduce irrelevant information—infrequent items are gonen frequency descending ordering: more frequent items are
more likely to be sharedn never be larger than the original database (if not count
node-links and counts)n Example: For Connect-4 DB, compression ratio could be
over 100January 17, 2001 Data Mining: Concepts and Techniques 20
Mining Frequent Patterns Using FP-tree
n General idea (divide-and-conquer)n Recursively grow frequent pattern path using the FP-
tree
n Method n For each item, construct its conditional pattern-base,
and then its conditional FP-treen Repeat the process on each newly created conditional
FP -tree
n Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
January 17, 2001 Data Mining: Concepts and Techniques 21
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the
FP -tree
2) Construct conditional FP-tree from each conditional
pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
§ If the conditional FP-tree contains a single path, simply enumerate all the patterns
January 17, 2001 Data Mining: Concepts and Techniques 22
Step 1: From FP -tree to Conditional Pattern Base
n Starting at the frequent header table in the FP-treen Traverse the FP-tree by following the link of each frequent itemn Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern bases
item cond. pattern basec f:3a fc:3b fca:1, f:1, c:1m fca:2, fcab:1p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
January 17, 2001 Data Mining: Concepts and Techniques 23
Properties of FP-tree for Conditional Pattern Base Construction
n Node-link property
n For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following
ai's node-links, starting from ai's head in the FP-tree header
n Prefix path property
n To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be
accumulated, and its frequency count should carry the same count as node ai.
January 17, 2001 Data Mining: Concepts and Techniques 24
Step 2: Construct Conditional FP-tree
n For each pattern-basen Accumulate the count for each item in the basen Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequent patterns concerning mm, fm, cm, am,fcm, fam, cam,fcam
Ú Ú
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
5
January 17, 2001 Data Mining: Concepts and Techniques 25
Mining Frequent Patterns by Creating Conditional Pattern-Bases
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
January 17, 2001 Data Mining: Concepts and Techniques 26
Step 3: Recursively mine the conditional FP-tree
{}
f:3
c:3
a:3m-conditional FP-tree
Cond. pattern base of “am”: ( fc:3)
{}
f:3
c:3am-conditional FP-tree
Cond. pattern base of “cm”: (f:3){}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
January 17, 2001 Data Mining: Concepts and Techniques 27
Single FP-tree Path Generation
n Suppose an FP -tree T has a single path P
n The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the
sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning mm, fm, cm, am,fcm, fam, cam,fcam
Ú
January 17, 2001 Data Mining: Concepts and Techniques 28
Principles of Frequent Pattern Growth
n Pattern growth property
n Let α be a frequent itemset in DB, B be α's conditional pattern base, and β be an itemset in B.
Then α ∪ β is a frequent itemset in DB iff β is frequent in B.
n “abcdef ” is a frequent pattern, if and only if
n “abcde ” is a frequent pattern, and
n “f ” is frequent in the set of transactions containing
“abcde ”
January 17, 2001 Data Mining: Concepts and Techniques 29
Why Is Frequent Pattern GrowthFast?
n Our performance study shows
n FP-growth is an order of magnitude faster than
Apriori, and is also faster than tree-projection
n Reasoning
n No candidate generation, no candidate test
n Use compact data structure
n Eliminate repeated database scan
n Basic operation is counting and FP-tree building
January 17, 2001 Data Mining: Concepts and Techniques 30
FP-growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Run
time(
sec.
)
D1 FP-growth runtime
D1 Apriori runtime
Data set T25I20D10K
6
January 17, 2001 Data Mining: Concepts and Techniques 31
FP-growth vs. Tree-Projection: Scalability with Support Threshold
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Support threshold (%)
Ru
ntim
e (s
ec.)
D2 FP-growth
D2 TreeProjection
Data set T25I20D100K
January 17, 2001 Data Mining: Concepts and Techniques 32
Presentation of Association Rules (Table Form )
January 17, 2001 Data Mining: Concepts and Techniques 33
Visualization of Association Rule Using Plane Graph
January 17, 2001 Data Mining: Concepts and Techniques 34
Visualization of Association Rule Using Rule Graph
January 17, 2001 Data Mining: Concepts and Techniques 35
Iceberg Queries
n Icerberg query : Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold
n Example:select P.custID, P.itemID, sum(P.qty)from purchase Pgroup by P.custID, P.itemIDhaving sum(P.qty) >= 10
n Compute iceberg queries efficiently by Apriori:
n First compute lower dimensionsn Then compute higher dimensions only when all the
lower ones are above the threshold
January 17, 2001 Data Mining: Concepts and Techniques 36
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
7
January 17, 2001 Data Mining: Concepts and Techniques 37
Multiple-Level Association Rules
n Items often form hierarchy.
n Items at the lower level are expected to have lower support.
n Rules regarding itemsets at
appropriate levels could be quite useful.
n Transaction database can be encoded based on dimensions and levels
n We can explore shared multi-level mining
Food
breadmilk
skim
SunsetFraser
2% whitewheat
TID ItemsT1 {111, 121, 211, 221}T2 {111, 211, 222, 323}T3 {112, 122, 221, 411}T4 {111, 121}T5 {111, 122, 211, 221, 413}
January 17, 2001 Data Mining: Concepts and Techniques 38
Mining Multi-Level Associations
n A top_down, progressive deepening approach:n First find high-level strong rules:
milk → bread [20%, 60%].n Then find their lower-level “weaker” rules:
2% milk → wheat bread [6%, 50%].
n Variations at mining multiple-level association rules.n Level-crossed association rules:
2% milk → Wonder wheat bread
n Association rules with multiple, alternative hierarchies:
2% milk → Wonder bread
January 17, 2001 Data Mining: Concepts and Techniques 39
Multi-level Association: Uniform Support vs. Reduced Support
n Uniform Support: the same minimum support for all levelsn + One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum support.
n – Lower level items do not occur as frequently. If support threshold n too high ⇒ miss low level associationsn too low ⇒ generate too many high level associations
n Reduced Support: reduced minimum support at lower levelsn There are 4 search strategies:
n Level-by-level independentn Level-cross filtering by k-itemsetn Level-cross filtering by single itemn Controlled level-cross filtering by single item
January 17, 2001 Data Mining: Concepts and Techniques 40
Uniform Support
Multi-level mining with uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Back
January 17, 2001 Data Mining: Concepts and Techniques 41
Reduced Support
Multi-level mining with reduced support
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 3%
Back
Milk
[support = 10%]
January 17, 2001 Data Mining: Concepts and Techniques 42
Multi-level Association: Redundancy Filtering
n Some rules may be redundant due to “ancestor” relationships between items.
n Examplen milk ⇒ wheat bread [support = 8%, confidence = 70%]
n 2% milk ⇒ wheat bread [support = 2%, confidence = 72%]
n We say the first rule is an ancestor of the second rule.
n A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.
8
January 17, 2001 Data Mining: Concepts and Techniques 43
Multi-Level Mining: Progressive Deepening
n A top-down, progressive deepening approach:n First mine high-level frequent items:
milk (15%), bread (10%)n Then mine their lower-level “weaker” frequent
itemsets:2% milk (5%), wheat bread (4%)
n Different min_support threshold across multi-levels lead to different algorithms:n If adopting the same min_support across multi-
levelsthen toss t if any of t’s ancestors is infrequent.
n If adopting reduced min_support at lower levelsthen examine only those descendents whose ancestor’s
support is frequent/non-negligible.January 17, 2001 Data Mining: Concepts and Techniques 44
Progressive Refinement of Data Mining Quality
n Why progressive refinement?
n Mining operator can be expensive or cheap, fine or rough
n Trade speed with quality: step-by -step refinement.
n Superset coverage property:
n Preserve all the positive answers—allow a positive false test but not a false negative test.
n Two- or multi-step mining:
n First apply rough/cheap operator (superset coverage)
n Then apply expensive algorithm on a substantially reduced candidate set (Koperski & Han, SSD’95).
January 17, 2001 Data Mining: Concepts and Techniques 45
Progressive Refinement Mining of Spatial Association Rules
n Hierarchy of spatial relationship:
n “g_close_to”: near_by, touch, intersect, contain, etc.n First search for rough relationship and then refine it.
n Two-step mining of spatial association:
n Step 1: rough spatial computation (as a filter) n Using MBR or R-tree for rough estimation.
n Step2: Detailed spatial algorithm (as refinement)n Apply only to those objects which have passed the rough
spatial association test (no less than min_support)
January 17, 2001 Data Mining: Concepts and Techniques 46
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 47
Multi-Dimensional Association: Concepts
n Single-dimensional rules:buys(X, “milk”) ⇒ buys(X, “bread”)
n Multi-dimensional rules: m 2 dimensions or predicatesn Inter-dimension association rules (no repeated
predicates )age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)
n hybrid-dimension association rules (repeated predicates )age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
n Categorical Attributesn finite number of possible values, no ordering among
valuesn Quantitative Attributes
n numeric, implicit ordering among valuesJanuary 17, 2001 Data Mining: Concepts and Techniques 48
Techniques for Mining MD Associations
n Search for frequent k-predicate set:n Example: {age, occupation, buys} is a 3-predicate
set.n Techniques can be categorized by how age are
treated.1. Using static discretization of quantitative attributes
n Quantitative attributes are statically discretized by using predefined concept hierarchies.
2. Quantitative association rulesn Quantitative attributes are dynamically discretized
into “bins”based on the distribution of the data.3. Distance-based association rules
n This is a dynamic discretization process that considers the distance between data points.
9
January 17, 2001 Data Mining: Concepts and Techniques 49
Static Discretization of Quantitative Attributes
n Discretized prior to mining using concept hierarchy.
n Numeric values are replaced by ranges.
n In relational database, finding all frequent k-predicate sets will require k or k+1 table scans.
n Data cube is well suited for mining.
n The cells of an n-dimensional
cuboid correspond to the
predicate sets.
n Mining from data cubescan be much faster.
(income)(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)January 17, 2001 Data Mining: Concepts and Techniques 50
Quantitative Association Rules
age(X,”30-34”) ∧ income(X,”24K -48K”)
⇒ buys(X,”high resolution TV”)
n Numeric attributes are dynamically discretizedn Such that the confidence or compactness of the rules
mined is maximized.n 2-D quantitative association rules: A quan1 ∧ A quan2 ⇒ A cat
n Cluster “adjacent” association rulesto form general rules using a 2-D grid.
n Example:
January 17, 2001 Data Mining: Concepts and Techniques 51
ARCS (Association Rule Clustering System)
How does ARCS work?
1. Binning
2. Find frequentpredicateset
3. Clustering
4. OptimizeJanuary 17, 2001 Data Mining: Concepts and Techniques 52
Limitations of ARCS
n Only quantitative attributes on LHS of rules.
n Only 2 attributes on LHS. (2D limitation)
n An alternative to ARCS
n Non-grid-based
n equi-depth binning
n clustering based on a measure of partial
completeness.
n “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. Agrawal.
January 17, 2001 Data Mining: Concepts and Techniques 53
Mining Distance-based Association Rules
n Binning methods do not capture the semantics of interval data
n Distance-based partitioning, more meaningful discretization considering:n density/number of points in an intervaln “closeness” of points in an interval
Price($)Equi-width(width $10)
Equi-depth(depth 2)
Distance-based
7 [0,10] [7,20] [7,7]20 [11,20] [22,50] [20,22]22 [21,30] [51,53] [50,53]50 [31,40]51 [41,50]53 [51,60]
January 17, 2001 Data Mining: Concepts and Techniques 54
n S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set X
n The diameter of S[X]:
n distx:distance metric, e.g. Euclidean distance or
Manhattan
)1(
])[],[(])[( 1 1
−=
∑ ∑= =
NN
XtXtdistXSd
jiN
i
N
jX
Clusters and Distance Measurements
10
January 17, 2001 Data Mining: Concepts and Techniques 55
n The diameter, d, assesses the density of a cluster CX , where
n Finding clusters and distance-based rulesn the density threshold, d0 , replaces the notion
of supportn modified version of the BIRCH clustering
algorithm
XdCd X 0)( ≤
0sCX ≥
Clusters and Distance Measurements(Cont.)
January 17, 2001 Data Mining: Concepts and Techniques 56
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 57
Interestingness Measurements
n Objective measuresTwo popular measurements: ¶ support; and · confidence
n Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if¶ it is unexpected (surprising to the user);
and/or· actionable (the user can do something with it)
January 17, 2001 Data Mining: Concepts and Techniques 58
Criticism to Support and Confidence
n Example 1: (Aggarwal & Yu, PODS98)n Among 5000 students
n 3000 play basketballn 3750 eat cerealn 2000 both play basket ball and eat cereal
n play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
n play basketball ⇒ not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
January 17, 2001 Data Mining: Concepts and Techniques 59
Criticism to Support and Confidence (Cont.)
n Example 2:n X and Y: positively correlated,n X and Z, negatively relatedn support and confidence of
X=>Z dominates n We need a measure of dependent
or correlated events
n P(B|A)/P(B) is also called the liftof rule A => B
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%)()(
)(, BPAP
BAPcorr BA
∪=
January 17, 2001 Data Mining: Concepts and Techniques 60
Other Interestingness Measures: Interest
n Interest (correlation, lift)
n taking both P(A) and P(B) in consideration
n P(A^B)=P(B)*P(A), if A and B are independent events
n A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated
)()()(
BPAPBAP ∧
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57
11
January 17, 2001 Data Mining: Concepts and Techniques 61
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 62
Constraint-Based Mining
n Interactive, exploratory mining giga-bytes of data? n Could it be real? — Making good use of constraints!
n What kinds of constraints can be used in mining?n Knowledge type constraint: classification, association,
etc.n Data constraint: SQL-like queries
n Find product pairs sold together in Vancouver in Dec.’98.
n Dimension/level constraints:n in relevance to region, price, brand, customer category.
n Rule constraintsn small sales (price < $10) triggers big sales (sum > $200).
n Interestingness constraints:n strong rules (min_support ≥ 3%, min_confidence ≥ 60%).
January 17, 2001 Data Mining: Concepts and Techniques 63
Rule Constraints in Association Mining
n Two kind of rule constraints:
n Rule form constraints: meta-rule guided mining.n P(x, y) ^ Q(x, w) → takes(x, “database systems”).
n Rule (content) constraint: constraint -based query optimization (Ng, et al., SIGMOD’98).n sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) >
1000
n 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99):
n 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above.
n 2-var: A constraint confining both sides (L and R).n sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
January 17, 2001 Data Mining: Concepts and Techniques 64
Constrain-Based Association Query
n Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)n A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
n where C is a set of constraints on S1, S2 including frequency constraint
n A classification of (single-variable) constraints:n Class constraint: S ⊂ A. e.g. S ⊂ Itemn Domain constraint:
n Sθ v, θ ∈ { =, ≠, <, ≤, >, ≥ }. e.g. S.Price < 100n vθ S, θ is ∈ or ∉. e.g. snacks ∉ S.Typen Vθ S, or Sθ V, θ ∈ { ⊆, ⊂, ⊄, =, ≠ }
n e.g. {snacks, sodas } ⊆ S.Typen Aggregation constraint: agg(S) θ v, where agg is in {min,
max, sum, count, avg}, and θ ∈ { =, ≠, <, ≤, >, ≥ }.n e.g. count(S1.Type) = 1 , avg(S2.Price) < 100
January 17, 2001 Data Mining: Concepts and Techniques 65
Constrained Association Query Optimization Problem
n Given a CAQ = { (S1, S2) | C }, the algorithm should be :n sound: It only finds frequent sets that satisfy the
given constraints Cn complete: All frequent sets satisfy the given
constraints C are foundn A naïve solution:
n Apply Apriori for finding all frequent sets, and thento test them for constraint satisfaction one by one.
n Our approach:n Comprehensive analysis of the properties of
constraints and try to push them as deeply as possible inside the frequent set computation.
January 17, 2001 Data Mining: Concepts and Techniques 66
Anti-monotone and Monotone Constraints
n A constraint Ca is antianti--monotonemonotone iff. for any
pattern S not satisfying Ca, none of the super-
patterns of S can satisfy Ca
n A constraint Cm is monotonemonotone iff. for any pattern
S satisfying Cm, every super-pattern of S also
satisfies it
12
January 17, 2001 Data Mining: Concepts and Techniques 67
Succinct Constraint
n A subset of item Is is a succinct setsuccinct set, if it can be expressed as σp(I) for some selection predicate p, where σ is a selection operator
n SP⊆2I is a succinct power setpower set, if there is a fixed number of succinct set I1, …, Ik ⊆I, s.t. SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus
n A constraint C s is succinctsuccinct provided SATCs(I) is a succinct power set
January 17, 2001 Data Mining: Concepts and Techniques 68
Convertible Constraint
n Suppose all items in patterns are listed in a total order R
n A constraint C is convertible anticonvertible anti--monotonemonotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies C
n A constraint C is convertible monotoneconvertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w.r.t. R also satisfies C
January 17, 2001 Data Mining: Concepts and Techniques 69
Relationships Among Categories of Constraints
Succinctness
Anti-monotonicity Monotonicity
Convertible constraints
Inconvertible constraints
January 17, 2001 Data Mining: Concepts and Techniques 70
Property of Constraints: Anti-Monotone
n Anti-monotonicity : If a set S violates the constraint, any superset of S violates the constraint.
n Examples:
n sum(S.Price) ≤ v is anti-monotone
n sum(S.Price) ≥ v is not anti-monotone
n sum(S.Price) = v is partly anti-monotone
n Application:
n Push “sum(S.price) ≤ 1000” deeply into iterative
frequent set computation.
January 17, 2001 Data Mining: Concepts and Techniques 71
Characterization of Anti-Monotonicity Constraints
S θ v, θ ∈ { = , ≤, ≥ }v ∈ SS ⊇ VS ⊆ VS = V
min(S) ≤ vmin(S) ≥ vmin(S) = vmax(S) ≤ vmax(S) ≥ vmax(S) = v
count(S) ≤ vcount(S) ≥ vcount(S) = vsum(S) ≤ vsum(S) ≥ vsum(S) = v
avg(S) θ v, θ ∈ { =, ≤ , ≥ }(frequent constraint)
yesnonoyes
partlynoyes
partlyyesno
partlyyesno
partlyyesno
partlyconvertible
(yes)January 17, 2001 Data Mining: Concepts and Techniques 72
Example of Convertible Constraints:Avg(S) θ V
n Let R be the value descending order over the set of itemsn E.g. I={9, 8, 6, 4, 3, 1}
n Avg(S) ≥ v is convertible monotone w.r.t. Rn If S is a suffix of S1, avg(S1) ≥ avg(S)
n {8, 4, 3} is a suffix of {9, 8, 4, 3}n avg({9, 8, 4, 3})=6 ≥ avg({8, 4, 3})=5
n If S satisfies avg(S) ≥v, so does S1n {8, 4, 3} satisfies constraint avg(S) ≥ 4, so does
{9, 8, 4, 3}
13
January 17, 2001 Data Mining: Concepts and Techniques 73
Property of Constraints: Succinctness
n Succinctness:
n For any set S 1 and S 2 satisfying C, S 1 ∪ S 2 satisfies Cn Given A 1 is the sets of size 1 satisfying C, then any set S
satisfying C are based on A 1 , i.e., it contains a subset belongs to A 1 ,
n Example :
n sum(S.Price ) ≥ v is not succinctn min(S.Price ) ≤ v is succinct
n Optimization:n If C is succinct, then C is pre-counting prunable. The
satisfaction of the constraint alone is not affected by the iterative support counting.
January 17, 2001 Data Mining: Concepts and Techniques 74
Characterization of Constraints by Succinctness
S θ v, θ ∈ { = , ≤, ≥ }v ∈ SS ⊇VS ⊆ VS = V
min(S) ≤ vmin(S) ≥ vmin(S) = vmax(S) ≤ vmax(S) ≥ vmax(S) = vcount(S) ≤ vcount(S) ≥ vcount(S) = vsum(S) ≤ vsum(S) ≥ vsum(S) = v
avg(S) θ v, θ ∈ { =, ≤ , ≥ }(frequent constraint)
Yesyesyesyesyesyesyesyesyesyesyes
weaklyweaklyweakly
nononono
(no)
January 17, 2001 Data Mining: Concepts and Techniques 75
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 76
Why Is the Big Pie Still There?
n More on constraint -based mining of associations n Boolean vs. quantitative associations
n Association on discrete vs. continuous datan From association to correlation and causal structure
analysis.n Association does not necessarily imply correlation or causal
relationshipsn From intra-trasanction association to inter-transaction
associationsn E.g., break the barriers of transactions (Lu, et al. TOIS’99).
n From association analysis to classification and clustering analysisn E.g, clustering association rules
January 17, 2001 Data Mining: Concepts and Techniques 77
Chapter 6: Mining Association Rules in Large Databases
n Association rule mining
n Mining single-dimensional Boolean association rules from transactional databases
n Mining multilevel association rules from transactional databases
n Mining multidimensional association rules from transactional databases and data warehouse
n From association mining to correlation analysis
n Constraint-based association mining
n Summary
January 17, 2001 Data Mining: Concepts and Techniques 78
Summary
n Association rule mining
n probably the most significant contribution from the database community in KDD
n A large number of papers have been published
n Many interesting issues have been explored
n An interesting research direction
n Association analysis in other types of data: spatial data, multimedia data, time series data, etc.
14
January 17, 2001 Data Mining: Concepts and Techniques 79
Referencesn R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent
itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000.
n R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.
n R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile.
n R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
n R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle, Washington.
n S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97, 265-276, Tucson, Arizona.
n S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.
n K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
n D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.
n M. Fang, N. Shivakumar, H. Garcia -Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
January 17, 2001 Data Mining: Concepts and Techniques 80
References (2)
n G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb. 2000.
n Y. Fu and J. Han. Meta-rule -guided mining of association rules in relational databases. KDO OD'95, 39-46, Singapore, Dec. 1995.
n T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada.
n E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, 277-288, Tucson, Arizona.
n J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, Sydney, Australia.
n J. Han and Y. Fu. Discovery of multiple -level association rules from large databases. VLDB'95, 420-431, Zurich, Switzerland.
n J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generat ion. SIGMOD'00, 1-12, Dallas, TX, May 2000.
n T. Imielinski and H. Mannila . A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996.
n M. Kamber, J. Han, and J. Y. Chiang. Metarule -guided mining of multi-dimensional association rules using data cubes. KDD'97, 207-210, Newport Beach, California.
n M. Klemettinen, H. Mannila , P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
January 17, 2001 Data Mining: Concepts and Techniques 81
References (3)
n F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New York, NY.
n B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.
n H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:1-12:7, Seattle, Washington.
n H. Mannila , H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.
n H. Mannila , H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data
Mining and Knowledge Discovery, 1:259-289, 1997.
n R. Meo, G. Psaila , and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133, Bombay, India.
n R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
n R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington.
n N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
January 17, 2001 Data Mining: Concepts and Techniques 82
References (4)n J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA, May 1995.n J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
DMKD'00, Dallas, TX, 11-20, May 2000.
n J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA. Aug. 2000.
n G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.
n B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL.
n J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA.
n S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY..
n S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA.
n A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland.
n A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
January 17, 2001 Data Mining: Concepts and Techniques 83
References (5)n C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal
structures. VLDB'98, 594-605, New York, NY.
n R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.
n R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 1-12, Montreal, Canada.
n R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California.
n H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996.
n D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
n K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
n M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.
n M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.
n O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.
January 17, 2001 Data Mining: Concepts and Techniques 84
http://www.cs.sfu.ca/~han/dmbook
Thank you !!!Thank you !!!