Data Mining-Association Mining_3.pdf

ASSOCIATION RULE MINING

Mining Frequent Itemsets using Vertical data format Horizontal Data format Vs Vertical data format

Vertical Data format Intersect the TID_sets of every pair of frequent single

items Vertical Data format Transform data from horizontal to vertical format Support count of an itemset is length of TID_set Starting with k=1, frequent itemsets are used to

construct candidate k+1 itemsets Apriori property is exploited No need to scan the database to find the support

of (k+1) itemsets To avoid long TID sets keep track of only the

differences of the TID_sets between a (k+1) itemset and corresponding k itemset Ex: diffset({I1, I2},{I1}) = {T500, T700}

Maximal and Closed Frequent Itemsets & Mining Closed Frequent Itemsets: Simple approach Mine complete set and eliminate

every frequent itemset that is a proper subset of some other set and has the same support

Directly mine closed frequent itemsets with effective pruning strategies

Mining Closed Frequent Itemsets Item Merging - If every transaction containing a

frequent item-set X also contains an item-set Y but not any proper superset of Y , then X Y forms a frequent closed item-set and there is no need to

search for any item-set containing X but no Y . Projected database for {I5: 2} is {{I2, I1},

{I2,I1,I3}}. Each transaction contains item-set {I2, I1} but no proper superset of {I2, I1}. So this can be merged with {I5} to give {I5, I2, I1:2}

There is no need to mine for closed item-sets that contain I5 but not {I2, I1}

Mining Closed Frequent Itemsets Sub-Itemset Pruning - If a frequent itemset X is a

proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y ), then X and all of Xs descendants in the set enumeration tree cannot be frequent closed itemsets and thus can be pruned. { , } min_sup =

2 Projection on a1 gives {a1, a2,,a50 : 2} based

on Itemset merging Support {a2} = support ({a1, a2,..a50}) = 2 and

a2 is a proper subset - no need to examine a2 and its projections

Mining Closed Frequent Itemsets Item Skipping

Depth-first mining of closed itemsets prefix itemset X associated with a header table

and a projected database. If a local frequent item p has the same support in

several header tables at different levels, one can safely prune p from the header tables at higher levels. a2 has same support in global header and a1s

projection can be pruned

Mining Closed Frequent Itemsets Closure Checking

Check if superset / subset of already found closed frequent itemsets with same support

Superset Checking Handled in Item Merging

Subset Checking Pattern tree maintain set of closed itemsets

mined so far (Similar to FP tree) If Sc is subsumed by another closed itemset

Sa then Both have same support Length of Sc is smaller than Sa All items in Sc are contained in Sa

Multilevel Association Rules Rules generated from association rule mining with

concept hierarchies Levels of the hierarchy Too general common sense knowledge Some rules can be discovered at higher levels

Uniform Support Same minimum support threshold for all levels Reduced Support Reduced minimum support threshold at lower levels

Using Reduced Support Level-by-level independent

Full-breadth search No back ground knowledge is used for pruning

Level-cross filtering by single item An item at the ith level is examined iff its parent

node at the (i-1)st level is frequent

Level-cross filtering by k-itemset A k-itemset at the ith level is examined iff its

corresponding parent k-itemset at (i-1)st level is frequent

Group based Support

Redundant Multilevel Association Rules Filtering Some rules may be redundant due to ancestor

relationships between items milk wheat bread [8%, 70%] 2% milk wheat bread [2%, 72%] First rule is an ancestor of the second rule A rule is redundant if its support and confidence are

close to their expected values, based on the rules ancestor.

Multidimensional Association Rules Single-dimensional rules buys(X, milk) buys(X, bread) Multidimensional rules(2 dimensions/predicates)

Inter-dimension assoc. rules (no repeated predicates) age(X,19-25) occupation(X,student)

buys(X, coke) Hybrid-dimension assoc. rules (repeated

predicates) age(X,19-25) buys(X, popcorn) buys(X, coke)

Categorical Attributes and Quantitative Attributes Categorical Attributes Finite number of possible values, no ordering

among values

Quantitative Attributes Numeric, implicit ordering among values

Mining Quantitative Associations Static discretization based on predefined concept

hierarchies

Dynamic discretization based on data distribution

Clustering: Distance-based association

Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy.

Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate

sets will require k or k+1 table scans. Data cube is well suited for mining. (faster) Fully materialized cubes may exist Apriori prunes search

Quantitative Association Rules Numeric attributes are dynamically discretized

The confidence of the rules mined is maximized Aquan1 and Aquan2 Acat

Cluster adjacent association rules to form general rules using a 2-D grid: ARCS (Association Rules

Clustering System) Binning

Equiwidth, Equidepth, Homogeneity Use a 2-D array

Finding frequent predicate sets Clustering the association rules

Grid based technique Rectangular regions

Clustering Association Rules: Example age(X,34) income(X,30 - 40K) buys(X,high

resolution TV) age(X,35) income(X,30 - 40K) buys(X,high



resolution TV)

Strong Vs Interesting User is the final judge Misleading rule

Buys (X , Computer Games) Buys (X , Videos) [40%, 66.7%]

May have to use other measures Correlation Analysis Correlation between itemsets A and B are calculated Lift Occurrence if A is independent of B if P(AUB)

= P(A) P(B) otherwise A and B are correlated

lift(A,B) = P(AUB) / P(A).P(B) If value < 1 A is negatively correlated with B Value > 1 A, B are positively correlated Value = 1 Independent Lift = conf(AB) / support(B)

Correlation Analysis Lift = P(Game and Video) / P(Game) x P(Video) =

0.89 Negative Correlation Buys (X , Computer Games)

Videos) is more accurate, although with lower support and confidence

Correlation Analysis 2 expected)2 / expected = (4000 4500)2/4500 = 555.6

Because 2 is greater than one and observed is less than expected for Video and Game Negative Correlation

Correlation Analysis All Confidence

All_conf(X) = sup(X) / max_item_sup(X) Maximum Single item support is considered Minimal confidence among the rules ij X ij

where ij

Cosine Measure Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2 Similar to Lift Influenced only by support of A, B and A U B not

by total number of transactions

Comparison of Correlation Measures All_confidence and cosine measure are null-invariant Can use these first followed by lift etc.

Correlation Analysis All-Confidence

If a pattern is all_confident (meets threshold) then all its sub-patterns are also all_confident

Leads to pruning of patterns which dont meet the all_confidence threshold

Correlation rules Reduces number of rules Meaningful rules are discovered Can use combination of measures

Constraint-based Data Mining Finding all the patterns in a database autonomously

unrealistic The patterns could be too many but not focused!

Data mining should be an interactive process User directs what to be mined using a data mining

query language (or a graphical user interface) Constraint-based mining

User flexibility: provides constraints on what to be mined

System optimization: explores such constraints for efficient miningconstraint-based mining

Constraints in Data Mining Knowledge type constraint:

classification, association, etc. Data constraint using SQL-like queries

find product pairs sold together in stores in Vancouver in Dec.00

Dimension/level constraint

in relevance to region, price, brand, customer category

Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum

> $200) Interestingness constraint

min_confidence 60%

Meta Rule guided Mining Makes Mining process more effective and efficient Users can specify the syntactic form of the rules Example:

P1(X, Y) P2(X, W) buys(X,Educational Software)

Rule form P1 2 l Q1 2 Qr p = l + r All frequent p-predicate sets and count of l-

predicate sets Cube Search

Rule Constraints Find the sales of which cheap items (where the sum

of prices is less than $100) may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago Customers in 2004

mine associations as lives_in(C, _, Chicago) sales+(C,?{I},{S})

sales+(C,?{J},{T}) from sales where S.year = 2004 and T.year = 2004 and I.group =

J.group

group by C, I.group having sum(I.price) < 100 and min(J.price) >= 500 with support threshold = 1% with confidence threshold = 1%

Rule Constraints Looks for rules of the form: Lives_in(C,_,Chicago) sales(C,?I1,S1

sales(C,?Ik,Sk 1,Ik 1,Sk} sales(C,?J1,T1 m,Tm 1,Jm}

1,Tm}

Mines rules like

sales(C, MS/Office,MS/SQLServer,_) [1.5, 68%]

Types of Constraints Anti-monotone If an itemset does not satisfy a

constraint none of its supersets will also satisfy the constraint Sum(I.price) = 500

Convertible Arranging may convert Avg(I.price)

Inconvertible Sum(S) v where

ASSOCIATION RULE MININGMining Frequent Itemsets using Vertical data format Horizontal Data format Vs Vertical data format

Vertical Data format Intersect the TID_sets of every pair of frequent single items

Vertical Data format Transform data from horizontal to vertical format Support count of an itemset is length of TID_set Starting with k=1, frequent itemsets are used to construct candidate k+1 itemsets Apriori property is exploited No need to scan the database to find the support of (k+1) itemsets To avoid long TID sets keep track of only the differences of the TID_sets between a (k+1) itemset and corresponding k itemset Ex: diffset({I1, I2},{I1}) = {T500, T700}

Maximal and Closed Frequent Itemsets & Mining Closed Frequent Itemsets: Simple approach Mine complete set and eliminate every frequent itemset that is a proper subset of some other set and has the same support Directly mine closed frequent itemsets with effective pruning strategies

Mining Closed Frequent Itemsets Item Merging - If every transaction containing a frequent item-set X also contains an item-set Y but not any proper superset of Y , then X ( Y forms a frequent closed item-set and there is no need to search for any item-set containing X but no Y . Projected database for {I5: 2} is {{I2, I1}, {I2,I1,I3}}. Each transaction contains item-set {I2, I1} but no proper superset of {I2, I1}. So this can be merged with {I5} to give {I5, I2, I1:2} There is no need to mine for closed item-sets that contain I5 but not {I2, I1}

Mining Closed Frequent Itemsets Sub-Itemset Pruning - If a frequent itemset X is a proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y ), then X and all of Xs descendants in the set enumeration tree cannot be frequent closed itemsets... { , } min_sup = 2 Projection on a1 gives {a1, a2,,a50 : 2} based on Itemset merging Support {a2} = support ({a1, a2,..a50}) = 2 and a2 is a proper subset - no need to examine a2 and its projections

Mining Closed Frequent Itemsets Item Skipping Depth-first mining of closed itemsets prefix itemset X associated with a header table and a projected database.

If a local frequent item p has the same support in several header tables at different levels, one can safely prune p from the header tables at higher levels. a2 has same support in global header and a1s projection can be pruned

Mining Closed Frequent Itemsets Closure Checking Check if superset / subset of already found closed frequent itemsets with same support Superset Checking Handled in Item Merging

Subset Checking Pattern tree maintain set of closed itemsets mined so far (Similar to FP tree) If Sc is subsumed by another closed itemset Sa then Both have same support Length of Sc is smaller than Sa All items in Sc are contained in Sa

Multilevel Association Rules Rules generated from association rule mining with concept hierarchies Levels of the hierarchy Too general common sense knowledge Some rules can be discovered at higher levels

Uniform Support Same minimum support threshold for all levels

Reduced Support Reduced minimum support threshold at lower levels

Using Reduced Support Level-by-level independent Full-breadth search No back ground knowledge is used for pruning

Level-cross filtering by single item An item at the ith level is examined iff its parent node at the (i-1)st level is frequent

Level-cross filtering by k-itemset A k-itemset at the ith level is examined iff its corresponding parent k-itemset at (i-1)st level is frequent

Group based Support

Redundant Multilevel Association Rules Filtering Some rules may be redundant due to ancestor relationships between itemsmilk wheat bread [8%, 70%]2% milk wheat bread [2%, 72%] First rule is an ancestor of the second rule A rule is redundant if its support and confidence are close to their expected values, based on the rules ancestor.

Multidimensional Association Rules Single-dimensional rulesbuys(X, milk) buys(X, bread) Multidimensional rules((2 dimensions/predicates) Inter-dimension assoc. rules (no repeated predicates) age(X,19-25) ( occupation(X,student) buys(X, coke) Hybrid-dimension assoc. rules (repeated predicates)age(X,19-25) ( buys(X, popcorn) buys(X, coke)

Categorical Attributes and Quantitative Attributes Categorical Attributes Finite number of possible values, no ordering among values

Quantitative Attributes Numeric, implicit ordering among values

Mining Quantitative Associations Static discretization based on predefined concept hierarchies Dynamic discretization based on data distribution Clustering: Distance-based association

Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suitedfor mining. (faster) Fully materialized cubesmay exist Apriori prunes search

Quantitative Association Rules Numeric attributes are dynamically discretized The confidence of the rules mined is maximized Aquan1 and Aquan2 ( Acat

Cluster adjacent association rules to form general rules using a 2-D grid: ARCS (Association Rules Clustering System) Binning Equiwidth, Equidepth, Homogeneity Use a 2-D array

Finding frequent predicate sets Clustering the association rules

Grid based technique Rectangular regions

Clustering Association Rules: Exampleage(X,34) ( income(X,30 - 40K) buys(X,high resolution TV)age(X,35) ( income(X,30 - 40K) buys(X,high resolution TV)age(X,34) ( income(X,40 - 50K) buys(X,high resolution TV)age(X,35) ( income(X,40 - 50K) buys(X,high resolution TV)

Strong Vs Interesting User is the final judge Misleading rule Buys (X , Computer Games) ( Buys (X , Videos) [40%, 66.7%] May have to use other measures

Correlation Analysis Correlation between itemsets A and B are calculated Lift Occurrence if A is independent of B if P(AUB) = P(A) P(B) otherwise A and B are correlated lift(A,B) = P(AUB) / P(A).P(B) If value < 1 A is negatively correlated with B Value > 1 A, B are positively correlated Value = 1 Independent Lift = conf(A(B) / support(B)

Correlation Analysis Lift = P(Game and Video) / P(Game) x P(Video) = 0.89 Negative Correlation Buys (X , Computer Games) ( Buys (X , Videos) is more accurate, although with lower support and confidence

Correlation Analysis (2 = (observed expected)2 / expected= (4000 4500)2/4500 = 555.6 Because (2 is greater than one and observed is less than expected for Video and Game Negative Correlation

Correlation Analysis All Confidence All_conf(X) = sup(X) / max_item_sup(X) Maximum Single item support is considered Minimal confidence among the rules ij ( X ij where ij X

Cosine Measure Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2 Similar to Lift Influenced only by support of A, B and A U B not by total number of transactions

Comparison of Correlation Measures All_confidence and cosine measure are null-invariant Can use these first followed by lift etc.

Correlation Analysis All-Confidence If a pattern is all_confident (meets threshold) then all its sub-patterns are also all_confident Leads to pruning of patterns which dont meet the all_confidence threshold

Correlation rules Reduces number of rules Meaningful rules are discovered Can use combination of measures

Constraint-based Data Mining Finding all the patterns in a database autonomously unrealistic The patterns could be too many but not focused!

Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface)

Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient miningconstraint-based mining

Constraints in Data Mining Knowledge type constraint: classification, association, etc.

Data constraint using SQL-like queries find product pairs sold together in stores in Vancouver in Dec.00

Dimension/level constraint in relevance to region, price, brand, customer category

Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum > $200)

Interestingness constraint strong rules: min_support 3%, min_confidence ( 60%

Meta Rule guided Mining Makes Mining process more effective and efficient Users can specify the syntactic form of the rules Example: P1(X, Y) ( P2(X, W) ( buys(X,Educational Software)

Rule form P1 P2 ( Pl ( Q1 Q2 ( Qr p = l + r All frequent p-predicate sets and count of l-predicate sets

Cube Search

Rule Constraints Find the sales of which cheap items (where the sum of prices is less than $100) may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago Customers in 2004mine associations aslives_in(C, _, Chicago) ( sales+(C,?{I},{S}) sales+(C,?{J},{T})from saleswhere S.year = 2004 and T.year = 2004 and I.group = J.groupgroup by C, I.grouphaving sum(I.price) < 100 and min(J.price) >= 500with support threshold = 1%with confidence threshold = 1%

Rule Constraints Looks for rules of the form:Lives_in(C,_,Chicago) ( sales(C,?I1,S1) sales(C,?Ik,Sk) I ={I1,Ik} S ={S1,Sk} sales(C,?J1,T1) sales(C,?Jm,Tm) J ={J1,Jm} T={T1,Tm}Mines rules likeLives_in(C,_,Chicago) sales(C, CD,_) sales(C, MS/Office,_) sales(C, MS/SQLServer,_) [1.5, 68%]

Types of Constraints Anti-monotone If an itemset does not satisfy a constraint none of its supersets will also satisfy the constraint Sum(I.price) = 500

Convertible Arranging may convert Avg(I.price)

Data Mining-Association Mining_3.pdf

Documents

Transcript of Data Mining-Association Mining_3.pdf