Data Mining-Association Mining_3.pdf

11
ASSOCIATION RULE MINING Mining Frequent Itemsets using Vertical data format Horizontal Data format Vs Vertical data format Vertical Data format Intersect the TID_sets of every pair of frequent single items Vertical Data format Transform data from horizontal to vertical format Support count of an itemset is length of TID_set Starting with k=1, frequent itemsets are used to construct candidate k+1 itemsets Apriori property is exploited No need to scan the database to find the support of (k+1) itemsets To avoid long TID sets keep track of only the differences of the TID_sets between a (k+1) itemset and corresponding k itemset Ex: diffset({I1, I2},{I1}) = {T500, T700} Maximal and Closed Frequent Itemsets & Mining Closed Frequent Itemsets: Simple approach Mine complete set and eliminate every frequent itemset that is a proper subset of some other set and has the same support Directly mine closed frequent itemsets with effective pruning strategies Mining Closed Frequent Itemsets Item Merging - If every transaction containing a frequent item-set X also contains an item-set Y but not any proper superset of Y , then X Y forms a frequent closed item-set and there is no need to

description

Data Mining-Association Mining_3.pdf

Transcript of Data Mining-Association Mining_3.pdf

  • ASSOCIATION RULE MINING

    Mining Frequent Itemsets using Vertical data format Horizontal Data format Vs Vertical data format

    Vertical Data format Intersect the TID_sets of every pair of frequent single

    items Vertical Data format Transform data from horizontal to vertical format Support count of an itemset is length of TID_set Starting with k=1, frequent itemsets are used to

    construct candidate k+1 itemsets Apriori property is exploited No need to scan the database to find the support

    of (k+1) itemsets To avoid long TID sets keep track of only the

    differences of the TID_sets between a (k+1) itemset and corresponding k itemset Ex: diffset({I1, I2},{I1}) = {T500, T700}

    Maximal and Closed Frequent Itemsets & Mining Closed Frequent Itemsets: Simple approach Mine complete set and eliminate

    every frequent itemset that is a proper subset of some other set and has the same support

    Directly mine closed frequent itemsets with effective pruning strategies

    Mining Closed Frequent Itemsets Item Merging - If every transaction containing a

    frequent item-set X also contains an item-set Y but not any proper superset of Y , then X Y forms a frequent closed item-set and there is no need to

  • search for any item-set containing X but no Y . Projected database for {I5: 2} is {{I2, I1},

    {I2,I1,I3}}. Each transaction contains item-set {I2, I1} but no proper superset of {I2, I1}. So this can be merged with {I5} to give {I5, I2, I1:2}

    There is no need to mine for closed item-sets that contain I5 but not {I2, I1}

    Mining Closed Frequent Itemsets Sub-Itemset Pruning - If a frequent itemset X is a

    proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y ), then X and all of Xs descendants in the set enumeration tree cannot be frequent closed itemsets and thus can be pruned. { , } min_sup =

    2 Projection on a1 gives {a1, a2,,a50 : 2} based

    on Itemset merging Support {a2} = support ({a1, a2,..a50}) = 2 and

    a2 is a proper subset - no need to examine a2 and its projections

    Mining Closed Frequent Itemsets Item Skipping

    Depth-first mining of closed itemsets prefix itemset X associated with a header table

    and a projected database. If a local frequent item p has the same support in

    several header tables at different levels, one can safely prune p from the header tables at higher levels. a2 has same support in global header and a1s

    projection can be pruned

  • Mining Closed Frequent Itemsets Closure Checking

    Check if superset / subset of already found closed frequent itemsets with same support

    Superset Checking Handled in Item Merging

    Subset Checking Pattern tree maintain set of closed itemsets

    mined so far (Similar to FP tree) If Sc is subsumed by another closed itemset

    Sa then Both have same support Length of Sc is smaller than Sa All items in Sc are contained in Sa

    Multilevel Association Rules Rules generated from association rule mining with

    concept hierarchies Levels of the hierarchy Too general common sense knowledge Some rules can be discovered at higher levels

    Uniform Support Same minimum support threshold for all levels Reduced Support Reduced minimum support threshold at lower levels

    Using Reduced Support Level-by-level independent

    Full-breadth search No back ground knowledge is used for pruning

    Level-cross filtering by single item An item at the ith level is examined iff its parent

    node at the (i-1)st level is frequent

  • Level-cross filtering by k-itemset A k-itemset at the ith level is examined iff its

    corresponding parent k-itemset at (i-1)st level is frequent

    Group based Support

    Redundant Multilevel Association Rules Filtering Some rules may be redundant due to ancestor

    relationships between items milk wheat bread [8%, 70%] 2% milk wheat bread [2%, 72%] First rule is an ancestor of the second rule A rule is redundant if its support and confidence are

    close to their expected values, based on the rules ancestor.

    Multidimensional Association Rules Single-dimensional rules buys(X, milk) buys(X, bread) Multidimensional rules(2 dimensions/predicates)

    Inter-dimension assoc. rules (no repeated predicates) age(X,19-25) occupation(X,student)

    buys(X, coke) Hybrid-dimension assoc. rules (repeated

    predicates) age(X,19-25) buys(X, popcorn) buys(X, coke)

  • Categorical Attributes and Quantitative Attributes Categorical Attributes Finite number of possible values, no ordering

    among values

    Quantitative Attributes Numeric, implicit ordering among values

    Mining Quantitative Associations Static discretization based on predefined concept

    hierarchies

    Dynamic discretization based on data distribution

    Clustering: Distance-based association

    Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy.

    Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate

    sets will require k or k+1 table scans. Data cube is well suited for mining. (faster) Fully materialized cubes may exist Apriori prunes search

    Quantitative Association Rules Numeric attributes are dynamically discretized

    The confidence of the rules mined is maximized Aquan1 and Aquan2 Acat

    Cluster adjacent association rules to form general rules using a 2-D grid: ARCS (Association Rules

  • Clustering System) Binning

    Equiwidth, Equidepth, Homogeneity Use a 2-D array

    Finding frequent predicate sets Clustering the association rules

    Grid based technique Rectangular regions

    Clustering Association Rules: Example age(X,34) income(X,30 - 40K) buys(X,high

    resolution TV) age(X,35) income(X,30 - 40K) buys(X,high

    resolution TV) age(X,34) income(X,40 - 50K) buys(X,high

    resolution TV) age(X,35) income(X,40 - 50K) buys(X,high

    resolution TV)

    Strong Vs Interesting User is the final judge Misleading rule

    Buys (X , Computer Games) Buys (X , Videos) [40%, 66.7%]

    May have to use other measures Correlation Analysis Correlation between itemsets A and B are calculated Lift Occurrence if A is independent of B if P(AUB)

    = P(A) P(B) otherwise A and B are correlated

  • lift(A,B) = P(AUB) / P(A).P(B) If value < 1 A is negatively correlated with B Value > 1 A, B are positively correlated Value = 1 Independent Lift = conf(AB) / support(B)

    Correlation Analysis Lift = P(Game and Video) / P(Game) x P(Video) =

    0.89 Negative Correlation Buys (X , Computer Games)

    Videos) is more accurate, although with lower support and confidence

    Correlation Analysis 2 expected)2 / expected = (4000 4500)2/4500 = 555.6

    Because 2 is greater than one and observed is less than expected for Video and Game Negative Correlation

    Correlation Analysis All Confidence

    All_conf(X) = sup(X) / max_item_sup(X) Maximum Single item support is considered Minimal confidence among the rules ij X ij

    where ij

    Cosine Measure Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2 Similar to Lift Influenced only by support of A, B and A U B not

    by total number of transactions

  • Comparison of Correlation Measures All_confidence and cosine measure are null-invariant Can use these first followed by lift etc.

    Correlation Analysis All-Confidence

    If a pattern is all_confident (meets threshold) then all its sub-patterns are also all_confident

    Leads to pruning of patterns which dont meet the all_confidence threshold

    Correlation rules Reduces number of rules Meaningful rules are discovered Can use combination of measures

    Constraint-based Data Mining Finding all the patterns in a database autonomously

    unrealistic The patterns could be too many but not focused!

    Data mining should be an interactive process User directs what to be mined using a data mining

    query language (or a graphical user interface) Constraint-based mining

    User flexibility: provides constraints on what to be mined

    System optimization: explores such constraints for efficient miningconstraint-based mining

    Constraints in Data Mining Knowledge type constraint:

    classification, association, etc. Data constraint using SQL-like queries

    find product pairs sold together in stores in Vancouver in Dec.00

    Dimension/level constraint

  • in relevance to region, price, brand, customer category

    Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum

    > $200) Interestingness constraint

    min_confidence 60%

    Meta Rule guided Mining Makes Mining process more effective and efficient Users can specify the syntactic form of the rules Example:

    P1(X, Y) P2(X, W) buys(X,Educational Software)

    Rule form P1 2 l Q1 2 Qr p = l + r All frequent p-predicate sets and count of l-

    predicate sets Cube Search

    Rule Constraints Find the sales of which cheap items (where the sum

    of prices is less than $100) may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago Customers in 2004

    mine associations as lives_in(C, _, Chicago) sales+(C,?{I},{S})

    sales+(C,?{J},{T}) from sales where S.year = 2004 and T.year = 2004 and I.group =

    J.group

  • group by C, I.group having sum(I.price) < 100 and min(J.price) >= 500 with support threshold = 1% with confidence threshold = 1%

    Rule Constraints Looks for rules of the form: Lives_in(C,_,Chicago) sales(C,?I1,S1

    sales(C,?Ik,Sk 1,Ik 1,Sk} sales(C,?J1,T1 m,Tm 1,Jm}

    1,Tm}

    Mines rules like

    sales(C, MS/Office,MS/SQLServer,_) [1.5, 68%]

    Types of Constraints Anti-monotone If an itemset does not satisfy a

    constraint none of its supersets will also satisfy the constraint Sum(I.price) = 500

    Convertible Arranging may convert Avg(I.price)

  • Inconvertible Sum(S) v where

    ASSOCIATION RULE MININGMining Frequent Itemsets using Vertical data format Horizontal Data format Vs Vertical data format

    Vertical Data format Intersect the TID_sets of every pair of frequent single items

    Vertical Data format Transform data from horizontal to vertical format Support count of an itemset is length of TID_set Starting with k=1, frequent itemsets are used to construct candidate k+1 itemsets Apriori property is exploited No need to scan the database to find the support of (k+1) itemsets To avoid long TID sets keep track of only the differences of the TID_sets between a (k+1) itemset and corresponding k itemset Ex: diffset({I1, I2},{I1}) = {T500, T700}

    Maximal and Closed Frequent Itemsets & Mining Closed Frequent Itemsets: Simple approach Mine complete set and eliminate every frequent itemset that is a proper subset of some other set and has the same support Directly mine closed frequent itemsets with effective pruning strategies

    Mining Closed Frequent Itemsets Item Merging - If every transaction containing a frequent item-set X also contains an item-set Y but not any proper superset of Y , then X ( Y forms a frequent closed item-set and there is no need to search for any item-set containing X but no Y . Projected database for {I5: 2} is {{I2, I1}, {I2,I1,I3}}. Each transaction contains item-set {I2, I1} but no proper superset of {I2, I1}. So this can be merged with {I5} to give {I5, I2, I1:2} There is no need to mine for closed item-sets that contain I5 but not {I2, I1}

    Mining Closed Frequent Itemsets Sub-Itemset Pruning - If a frequent itemset X is a proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y ), then X and all of Xs descendants in the set enumeration tree cannot be frequent closed itemsets... { , } min_sup = 2 Projection on a1 gives {a1, a2,,a50 : 2} based on Itemset merging Support {a2} = support ({a1, a2,..a50}) = 2 and a2 is a proper subset - no need to examine a2 and its projections

    Mining Closed Frequent Itemsets Item Skipping Depth-first mining of closed itemsets prefix itemset X associated with a header table and a projected database.

    If a local frequent item p has the same support in several header tables at different levels, one can safely prune p from the header tables at higher levels. a2 has same support in global header and a1s projection can be pruned

    Mining Closed Frequent Itemsets Closure Checking Check if superset / subset of already found closed frequent itemsets with same support Superset Checking Handled in Item Merging

    Subset Checking Pattern tree maintain set of closed itemsets mined so far (Similar to FP tree) If Sc is subsumed by another closed itemset Sa then Both have same support Length of Sc is smaller than Sa All items in Sc are contained in Sa

    Multilevel Association Rules Rules generated from association rule mining with concept hierarchies Levels of the hierarchy Too general common sense knowledge Some rules can be discovered at higher levels

    Uniform Support Same minimum support threshold for all levels

    Reduced Support Reduced minimum support threshold at lower levels

    Using Reduced Support Level-by-level independent Full-breadth search No back ground knowledge is used for pruning

    Level-cross filtering by single item An item at the ith level is examined iff its parent node at the (i-1)st level is frequent

    Level-cross filtering by k-itemset A k-itemset at the ith level is examined iff its corresponding parent k-itemset at (i-1)st level is frequent

    Group based Support

    Redundant Multilevel Association Rules Filtering Some rules may be redundant due to ancestor relationships between itemsmilk wheat bread [8%, 70%]2% milk wheat bread [2%, 72%] First rule is an ancestor of the second rule A rule is redundant if its support and confidence are close to their expected values, based on the rules ancestor.

    Multidimensional Association Rules Single-dimensional rulesbuys(X, milk) buys(X, bread) Multidimensional rules((2 dimensions/predicates) Inter-dimension assoc. rules (no repeated predicates) age(X,19-25) ( occupation(X,student) buys(X, coke) Hybrid-dimension assoc. rules (repeated predicates)age(X,19-25) ( buys(X, popcorn) buys(X, coke)

    Categorical Attributes and Quantitative Attributes Categorical Attributes Finite number of possible values, no ordering among values

    Quantitative Attributes Numeric, implicit ordering among values

    Mining Quantitative Associations Static discretization based on predefined concept hierarchies Dynamic discretization based on data distribution Clustering: Distance-based association

    Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suitedfor mining. (faster) Fully materialized cubesmay exist Apriori prunes search

    Quantitative Association Rules Numeric attributes are dynamically discretized The confidence of the rules mined is maximized Aquan1 and Aquan2 ( Acat

    Cluster adjacent association rules to form general rules using a 2-D grid: ARCS (Association Rules Clustering System) Binning Equiwidth, Equidepth, Homogeneity Use a 2-D array

    Finding frequent predicate sets Clustering the association rules

    Grid based technique Rectangular regions

    Clustering Association Rules: Exampleage(X,34) ( income(X,30 - 40K) buys(X,high resolution TV)age(X,35) ( income(X,30 - 40K) buys(X,high resolution TV)age(X,34) ( income(X,40 - 50K) buys(X,high resolution TV)age(X,35) ( income(X,40 - 50K) buys(X,high resolution TV)

    Strong Vs Interesting User is the final judge Misleading rule Buys (X , Computer Games) ( Buys (X , Videos) [40%, 66.7%] May have to use other measures

    Correlation Analysis Correlation between itemsets A and B are calculated Lift Occurrence if A is independent of B if P(AUB) = P(A) P(B) otherwise A and B are correlated lift(A,B) = P(AUB) / P(A).P(B) If value < 1 A is negatively correlated with B Value > 1 A, B are positively correlated Value = 1 Independent Lift = conf(A(B) / support(B)

    Correlation Analysis Lift = P(Game and Video) / P(Game) x P(Video) = 0.89 Negative Correlation Buys (X , Computer Games) ( Buys (X , Videos) is more accurate, although with lower support and confidence

    Correlation Analysis (2 = (observed expected)2 / expected= (4000 4500)2/4500 = 555.6 Because (2 is greater than one and observed is less than expected for Video and Game Negative Correlation

    Correlation Analysis All Confidence All_conf(X) = sup(X) / max_item_sup(X) Maximum Single item support is considered Minimal confidence among the rules ij ( X ij where ij X

    Cosine Measure Cosine(A, B) = P(AUB) / (P(A) x P(B))1/2 Similar to Lift Influenced only by support of A, B and A U B not by total number of transactions

    Comparison of Correlation Measures All_confidence and cosine measure are null-invariant Can use these first followed by lift etc.

    Correlation Analysis All-Confidence If a pattern is all_confident (meets threshold) then all its sub-patterns are also all_confident Leads to pruning of patterns which dont meet the all_confidence threshold

    Correlation rules Reduces number of rules Meaningful rules are discovered Can use combination of measures

    Constraint-based Data Mining Finding all the patterns in a database autonomously unrealistic The patterns could be too many but not focused!

    Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface)

    Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient miningconstraint-based mining

    Constraints in Data Mining Knowledge type constraint: classification, association, etc.

    Data constraint using SQL-like queries find product pairs sold together in stores in Vancouver in Dec.00

    Dimension/level constraint in relevance to region, price, brand, customer category

    Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum > $200)

    Interestingness constraint strong rules: min_support 3%, min_confidence ( 60%

    Meta Rule guided Mining Makes Mining process more effective and efficient Users can specify the syntactic form of the rules Example: P1(X, Y) ( P2(X, W) ( buys(X,Educational Software)

    Rule form P1 P2 ( Pl ( Q1 Q2 ( Qr p = l + r All frequent p-predicate sets and count of l-predicate sets

    Cube Search

    Rule Constraints Find the sales of which cheap items (where the sum of prices is less than $100) may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago Customers in 2004mine associations aslives_in(C, _, Chicago) ( sales+(C,?{I},{S}) sales+(C,?{J},{T})from saleswhere S.year = 2004 and T.year = 2004 and I.group = J.groupgroup by C, I.grouphaving sum(I.price) < 100 and min(J.price) >= 500with support threshold = 1%with confidence threshold = 1%

    Rule Constraints Looks for rules of the form:Lives_in(C,_,Chicago) ( sales(C,?I1,S1) sales(C,?Ik,Sk) I ={I1,Ik} S ={S1,Sk} sales(C,?J1,T1) sales(C,?Jm,Tm) J ={J1,Jm} T={T1,Tm}Mines rules likeLives_in(C,_,Chicago) sales(C, CD,_) sales(C, MS/Office,_) sales(C, MS/SQLServer,_) [1.5, 68%]

    Types of Constraints Anti-monotone If an itemset does not satisfy a constraint none of its supersets will also satisfy the constraint Sum(I.price) = 500

    Convertible Arranging may convert Avg(I.price)