Data Mining-Association Mining_3.pdf
-
Upload
raj-endran -
Category
Documents
-
view
4 -
download
1
description
Transcript of Data Mining-Association Mining_3.pdf
-
ASSOCIATION RULE MINING
Mining Frequent Itemsets using Vertical data format Horizontal Data format Vs Vertical data format
Vertical Data format Intersect the TID_sets of every pair of frequent single
items Vertical Data format Transform data from horizontal to vertical format Support count of an itemset is length of TID_set Starting with k=1, frequent itemsets are used to
construct candidate k+1 itemsets Apriori property is exploited No need to scan the database to find the support
of (k+1) itemsets To avoid long TID sets keep track of only the
differences of the TID_sets between a (k+1) itemset and corresponding k itemset Ex: diffset({I1, I2},{I1}) = {T500, T700}
Maximal and Closed Frequent Itemsets & Mining Closed Frequent Itemsets: Simple approach Mine complete set and eliminate
every frequent itemset that is a proper subset of some other set and has the same support
Directly mine closed frequent itemsets with effective pruning strategies
Mining Closed Frequent Itemsets Item Merging - If every transaction containing a
frequent item-set X also contains an item-set Y but not any proper superset of Y , then X Y forms a frequent closed item-set and there is no need to
-
search for any item-set containing X but no Y . Projected database for {I5: 2} is {{I2, I1},
{I2,I1,I3}}. Each transaction contains item-set {I2, I1} but no proper superset of {I2, I1}. So this can be merged with {I5} to give {I5, I2, I1:2}
There is no need to mine for closed item-sets that contain I5 but not {I2, I1}
Mining Closed Frequent Itemsets Sub-Itemset Pruning - If a frequent itemset X is a
proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y ), then X and all of Xs descendants in the set enumeration tree cannot be frequent closed itemsets and thus can be pruned. { , } min_sup =
2 Projection on a1 gives {a1, a2,,a50 : 2} based
on Itemset merging Support {a2} = support ({a1, a2,..a50}) = 2 and
a2 is a proper subset - no need to examine a2 and its projections
Mining Closed Frequent Itemsets Item Skipping
Depth-first mining of closed itemsets prefix itemset X associated with a header table
and a projected database. If a local frequent item p has the same support in
several header tables at different levels, one can safely prune p from the header tables at higher levels. a2 has same support in global header and a1s
projection can be pruned
-
Mining Closed Frequent Itemsets Closure Checking
Check if superset / subset of already found closed frequent itemsets with same support
Superset Checking Handled in Item Merging
Subset Checking Pattern tree maintain set of closed itemsets
mined so far (Similar to FP tree) If Sc is subsumed by another closed itemset
Sa then Both have same support Length of Sc is smaller than Sa All items in Sc are contained in Sa
Multilevel Association Rules Rules generated from association rule mining with
concept hierarchies Levels of the hierarchy Too general common sense knowledge Some rules can be discovered at higher levels
Uniform Support Same minimum support threshold for all levels Reduced Support Reduced minimum support threshold at lower levels
Using Reduced Support Level-by-level independent
Full-breadth search No back ground knowledge is used for pruning
Level-cross filtering by single item An item at the ith level is examined iff its parent
node at the (i-1)st level is frequent
-
Level-cross filtering by k-itemset A k-itemset at the ith level is examined iff its
corresponding parent k-itemset at (i-1)st level is frequent
Group based Support
Redundant Multilevel Association Rules Filtering Some rules may be redundant due to ancestor
relationships between items milk wheat bread [8%, 70%] 2% milk wheat bread [2%, 72%] First rule is an ancestor of the second rule A rule is redundant if its support and confidence are
close to their expected values, based on the rules ancestor.
Multidimensional Association Rules Single-dimensional rules buys(X, milk) buys(X, bread) Multidimensional rules(2 dimensions/predicates)
Inter-dimension assoc. rules (no repeated predicates) age(X,19-25) occupation(X,student)
buys(X, coke) Hybrid-dimension assoc. rules (repeated
predicates) age(X,19-25) buys(X, popcorn) buys(X, coke)
-
Categorical Attributes and Quantitative Attributes Categorical Attributes Finite number of possible values, no ordering
among values
Quantitative Attributes Numeric, implicit ordering among values
Mining Quantitative Associations Static discretization based on predefined concept
hierarchies
Dynamic discretization based on data distribution
Clustering: Distance-based association
Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans. Data cube is well suited for mining. (faster) Fully materialized cubes may exist Apriori prunes search
Quantitative Association Rules Numeric attributes are dynamically discretized
The confidence of the rules mined is maximized Aquan1 and Aquan2 Acat
Cluster adjacent association rules to form general rules using a 2-D grid: ARCS (Association Rules
-
Clustering System) Binning
Equiwidth, Equidepth, Homogeneity Use a 2-D array
Finding frequent predicate sets Clustering the association rules
Grid based technique Rectangular regions
Clustering Association Rules: Example age(X,34) income(X,30 - 40K) buys(X,high
resolution TV) age(X,35) income(X,30 - 40K) buys(X,high
resolution TV) age(X,34) income(X,40 - 50K) buys(X,high
resolution TV) age(X,35) income(X,40 - 50K) buys(X,high
resolution TV)
Strong Vs Interesting User is the final judge Misleading rule
Buys (X , Computer Games) Buys (X , Videos) [40%, 66.7%]
May have to use other measures Correlation Analysis Correlation between itemsets A and B are calculated Lift Occurrence if A is independent of B if P(AUB)
= P(A) P(B) otherwise A and B are correlated
-
lift(A,B) = P(AUB) / P(A).P(B) If value < 1 A is negatively correlated with B Value > 1 A, B are positively correlated Value = 1 Independent Lift = conf(AB) / support(B)
Correlation Analysis Lift = P(Game and Video) / P(Game) x P(Video) =
0.89 Negative Correlation Buys (X , Computer Games)
Videos) is more accurate, although with lower support and confidence
Correlation Analysis 2 expected)2 / expected = (4000 4500)2/4500 = 555.6
Because 2 is greater than one and observed is less than expected for Video and Game Negative Correlation
Correlation Analysis All Confidence
All_conf(X) = sup(X) / max_item_sup(X) Maximum Single item support is considered Minimal confidence among the rules ij X ij
where ij
Cosine Measure Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2 Similar to Lift Influenced only by support of A, B and A U B not
by total number of transactions
-
Comparison of Correlation Measures All_confidence and cosine measure are null-invariant Can use these first followed by lift etc.
Correlation Analysis All-Confidence
If a pattern is all_confident (meets threshold) then all its sub-patterns are also all_confident
Leads to pruning of patterns which dont meet the all_confidence threshold
Correlation rules Reduces number of rules Meaningful rules are discovered Can use combination of measures
Constraint-based Data Mining Finding all the patterns in a database autonomously
unrealistic The patterns could be too many but not focused!
Data mining should be an interactive process User directs what to be mined using a data mining
query language (or a graphical user interface) Constraint-based mining
User flexibility: provides constraints on what to be mined
System optimization: explores such constraints for efficient miningconstraint-based mining
Constraints in Data Mining Knowledge type constraint:
classification, association, etc. Data constraint using SQL-like queries
find product pairs sold together in stores in Vancouver in Dec.00
Dimension/level constraint
-
in relevance to region, price, brand, customer category
Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum
> $200) Interestingness constraint
min_confidence 60%
Meta Rule guided Mining Makes Mining process more effective and efficient Users can specify the syntactic form of the rules Example:
P1(X, Y) P2(X, W) buys(X,Educational Software)
Rule form P1 2 l Q1 2 Qr p = l + r All frequent p-predicate sets and count of l-
predicate sets Cube Search
Rule Constraints Find the sales of which cheap items (where the sum
of prices is less than $100) may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago Customers in 2004
mine associations as lives_in(C, _, Chicago) sales+(C,?{I},{S})
sales+(C,?{J},{T}) from sales where S.year = 2004 and T.year = 2004 and I.group =
J.group
-
group by C, I.group having sum(I.price) < 100 and min(J.price) >= 500 with support threshold = 1% with confidence threshold = 1%
Rule Constraints Looks for rules of the form: Lives_in(C,_,Chicago) sales(C,?I1,S1
sales(C,?Ik,Sk 1,Ik 1,Sk} sales(C,?J1,T1 m,Tm 1,Jm}
1,Tm}
Mines rules like
sales(C, MS/Office,MS/SQLServer,_) [1.5, 68%]
Types of Constraints Anti-monotone If an itemset does not satisfy a
constraint none of its supersets will also satisfy the constraint Sum(I.price) = 500
Convertible Arranging may convert Avg(I.price)
-
Inconvertible Sum(S) v where
ASSOCIATION RULE MININGMining Frequent Itemsets using Vertical data format Horizontal Data format Vs Vertical data format
Vertical Data format Intersect the TID_sets of every pair of frequent single items
Vertical Data format Transform data from horizontal to vertical format Support count of an itemset is length of TID_set Starting with k=1, frequent itemsets are used to construct candidate k+1 itemsets Apriori property is exploited No need to scan the database to find the support of (k+1) itemsets To avoid long TID sets keep track of only the differences of the TID_sets between a (k+1) itemset and corresponding k itemset Ex: diffset({I1, I2},{I1}) = {T500, T700}
Maximal and Closed Frequent Itemsets & Mining Closed Frequent Itemsets: Simple approach Mine complete set and eliminate every frequent itemset that is a proper subset of some other set and has the same support Directly mine closed frequent itemsets with effective pruning strategies
Mining Closed Frequent Itemsets Item Merging - If every transaction containing a frequent item-set X also contains an item-set Y but not any proper superset of Y , then X ( Y forms a frequent closed item-set and there is no need to search for any item-set containing X but no Y . Projected database for {I5: 2} is {{I2, I1}, {I2,I1,I3}}. Each transaction contains item-set {I2, I1} but no proper superset of {I2, I1}. So this can be merged with {I5} to give {I5, I2, I1:2} There is no need to mine for closed item-sets that contain I5 but not {I2, I1}
Mining Closed Frequent Itemsets Sub-Itemset Pruning - If a frequent itemset X is a proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y ), then X and all of Xs descendants in the set enumeration tree cannot be frequent closed itemsets... { , } min_sup = 2 Projection on a1 gives {a1, a2,,a50 : 2} based on Itemset merging Support {a2} = support ({a1, a2,..a50}) = 2 and a2 is a proper subset - no need to examine a2 and its projections
Mining Closed Frequent Itemsets Item Skipping Depth-first mining of closed itemsets prefix itemset X associated with a header table and a projected database.
If a local frequent item p has the same support in several header tables at different levels, one can safely prune p from the header tables at higher levels. a2 has same support in global header and a1s projection can be pruned
Mining Closed Frequent Itemsets Closure Checking Check if superset / subset of already found closed frequent itemsets with same support Superset Checking Handled in Item Merging
Subset Checking Pattern tree maintain set of closed itemsets mined so far (Similar to FP tree) If Sc is subsumed by another closed itemset Sa then Both have same support Length of Sc is smaller than Sa All items in Sc are contained in Sa
Multilevel Association Rules Rules generated from association rule mining with concept hierarchies Levels of the hierarchy Too general common sense knowledge Some rules can be discovered at higher levels
Uniform Support Same minimum support threshold for all levels
Reduced Support Reduced minimum support threshold at lower levels
Using Reduced Support Level-by-level independent Full-breadth search No back ground knowledge is used for pruning
Level-cross filtering by single item An item at the ith level is examined iff its parent node at the (i-1)st level is frequent
Level-cross filtering by k-itemset A k-itemset at the ith level is examined iff its corresponding parent k-itemset at (i-1)st level is frequent
Group based Support
Redundant Multilevel Association Rules Filtering Some rules may be redundant due to ancestor relationships between itemsmilk wheat bread [8%, 70%]2% milk wheat bread [2%, 72%] First rule is an ancestor of the second rule A rule is redundant if its support and confidence are close to their expected values, based on the rules ancestor.
Multidimensional Association Rules Single-dimensional rulesbuys(X, milk) buys(X, bread) Multidimensional rules((2 dimensions/predicates) Inter-dimension assoc. rules (no repeated predicates) age(X,19-25) ( occupation(X,student) buys(X, coke) Hybrid-dimension assoc. rules (repeated predicates)age(X,19-25) ( buys(X, popcorn) buys(X, coke)
Categorical Attributes and Quantitative Attributes Categorical Attributes Finite number of possible values, no ordering among values
Quantitative Attributes Numeric, implicit ordering among values
Mining Quantitative Associations Static discretization based on predefined concept hierarchies Dynamic discretization based on data distribution Clustering: Distance-based association
Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suitedfor mining. (faster) Fully materialized cubesmay exist Apriori prunes search
Quantitative Association Rules Numeric attributes are dynamically discretized The confidence of the rules mined is maximized Aquan1 and Aquan2 ( Acat
Cluster adjacent association rules to form general rules using a 2-D grid: ARCS (Association Rules Clustering System) Binning Equiwidth, Equidepth, Homogeneity Use a 2-D array
Finding frequent predicate sets Clustering the association rules
Grid based technique Rectangular regions
Clustering Association Rules: Exampleage(X,34) ( income(X,30 - 40K) buys(X,high resolution TV)age(X,35) ( income(X,30 - 40K) buys(X,high resolution TV)age(X,34) ( income(X,40 - 50K) buys(X,high resolution TV)age(X,35) ( income(X,40 - 50K) buys(X,high resolution TV)
Strong Vs Interesting User is the final judge Misleading rule Buys (X , Computer Games) ( Buys (X , Videos) [40%, 66.7%] May have to use other measures
Correlation Analysis Correlation between itemsets A and B are calculated Lift Occurrence if A is independent of B if P(AUB) = P(A) P(B) otherwise A and B are correlated lift(A,B) = P(AUB) / P(A).P(B) If value < 1 A is negatively correlated with B Value > 1 A, B are positively correlated Value = 1 Independent Lift = conf(A(B) / support(B)
Correlation Analysis Lift = P(Game and Video) / P(Game) x P(Video) = 0.89 Negative Correlation Buys (X , Computer Games) ( Buys (X , Videos) is more accurate, although with lower support and confidence
Correlation Analysis (2 = (observed expected)2 / expected= (4000 4500)2/4500 = 555.6 Because (2 is greater than one and observed is less than expected for Video and Game Negative Correlation
Correlation Analysis All Confidence All_conf(X) = sup(X) / max_item_sup(X) Maximum Single item support is considered Minimal confidence among the rules ij ( X ij where ij X
Cosine Measure Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2 Similar to Lift Influenced only by support of A, B and A U B not by total number of transactions
Comparison of Correlation Measures All_confidence and cosine measure are null-invariant Can use these first followed by lift etc.
Correlation Analysis All-Confidence If a pattern is all_confident (meets threshold) then all its sub-patterns are also all_confident Leads to pruning of patterns which dont meet the all_confidence threshold
Correlation rules Reduces number of rules Meaningful rules are discovered Can use combination of measures
Constraint-based Data Mining Finding all the patterns in a database autonomously unrealistic The patterns could be too many but not focused!
Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface)
Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient miningconstraint-based mining
Constraints in Data Mining Knowledge type constraint: classification, association, etc.
Data constraint using SQL-like queries find product pairs sold together in stores in Vancouver in Dec.00
Dimension/level constraint in relevance to region, price, brand, customer category
Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum > $200)
Interestingness constraint strong rules: min_support 3%, min_confidence ( 60%
Meta Rule guided Mining Makes Mining process more effective and efficient Users can specify the syntactic form of the rules Example: P1(X, Y) ( P2(X, W) ( buys(X,Educational Software)
Rule form P1 P2 ( Pl ( Q1 Q2 ( Qr p = l + r All frequent p-predicate sets and count of l-predicate sets
Cube Search
Rule Constraints Find the sales of which cheap items (where the sum of prices is less than $100) may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago Customers in 2004mine associations aslives_in(C, _, Chicago) ( sales+(C,?{I},{S}) sales+(C,?{J},{T})from saleswhere S.year = 2004 and T.year = 2004 and I.group = J.groupgroup by C, I.grouphaving sum(I.price) < 100 and min(J.price) >= 500with support threshold = 1%with confidence threshold = 1%
Rule Constraints Looks for rules of the form:Lives_in(C,_,Chicago) ( sales(C,?I1,S1) sales(C,?Ik,Sk) I ={I1,Ik} S ={S1,Sk} sales(C,?J1,T1) sales(C,?Jm,Tm) J ={J1,Jm} T={T1,Tm}Mines rules likeLives_in(C,_,Chicago) sales(C, CD,_) sales(C, MS/Office,_) sales(C, MS/SQLServer,_) [1.5, 68%]
Types of Constraints Anti-monotone If an itemset does not satisfy a constraint none of its supersets will also satisfy the constraint Sum(I.price) = 500
Convertible Arranging may convert Avg(I.price)