Data Mining-Association Mining 2

4
ASSOCIATION RULE MINING Generating Association Rules from Frequent Itemsets Strong association rules satisfy both minimum support and minimum confidence levels Confidence (A B) = P(B / A ) = support_count(A U B) / support_count(A) Association rules For each frequent itemset l, generate all non- empty subsets of l For every non-empty subset s of l, output s -s) if sup_count(l) / sup_count(s) >= min_conf Example I = {I1, I2, I5} Confidence Threshold : 70% Non empty subsets: {I1, I2}, {I1, I5}, {I2, I5} {I1}, {I2}, {I5} I1 I1 I2 I I1 I2 I5 Improving the Efficiency of Apriori Hash based technique Transaction reduction A transaction which does not contain k frequent itemsets cannot contain k+1 frequent itemsets Partitioning

description

Data Mining-Association Mining 2

Transcript of Data Mining-Association Mining 2

  • ASSOCIATION RULE MINING

    Generating Association Rules from Frequent Itemsets Strong association rules satisfy both minimum

    support and minimum confidence levels Confidence (A B)

    = P(B / A ) = support_count(A U B) / support_count(A)

    Association rules For each frequent itemset l, generate all non-

    empty subsets of l For every non-empty subset s of l, output s -s)

    if sup_count(l) / sup_count(s) >= min_conf

    Example I = {I1, I2, I5} Confidence Threshold : 70% Non empty subsets: {I1, I2}, {I1, I5}, {I2, I5} {I1}, {I2}, {I5} I1 I1 I2 I I1 I2 I5

    Improving the Efficiency of Apriori Hash based technique Transaction reduction A transaction which does not contain k frequent

    itemsets cannot contain k+1 frequent itemsets Partitioning

  • Sampling Dynamic itemset counting Start points

    Hash Based Technique Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must

    be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent

    patterns Scan 2: consolidate global frequent patterns

    Sampling for Frequent Patterns Select a sample of original database, mine frequent

    patterns within sample using Apriori Can use a lower support threshold

    Scan database once to verify frequent itemsets found in sample

    Scan database again to find missed frequent patterns

    Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning

    and generates lots of candidates To find frequent itemset i1i2i100 # of scans: 100 # of Candidates: = 2100-1 = 1.27*1030

    Bottleneck: candidate-generation-and-test Avoid candidate generation

  • Mining Frequent Patterns Without Candidate Generation FP Growth Divide and Conquer technique FP-Tree Grow long patterns from short ones using local

    frequent items FP-tree from a Transaction Database - Example FP-Growth For each frequent length-1 pattern(Suffix pattern): Construct conditional pattern base (Sub-database

    consisting of set of prefix paths co-occurring with suffix)

    Construct conditional FP-tree and mine recursively

    Generate all combinations of frequent patterns by combing with suffix

    FP-Growth

    Algorithm Input: A transaction db D; min_sup Output: Frequent patterns Construction of FP-Tree Scan database, collect frequent items F and sort in

    descending order of support Create root of FP-tree labeled null For each Trans, sort in descending order [p|P] Insert_tree([p|P],T) If T has a child N = p, increment count else create new node with count 1 and set

    parent and node links If P is non-empty call insert_tree(P,N) recursively

  • Algorithm Procedure FP_growth (Tree, a) If Tree contains a single path P then for each combination of nodes- b generate b a with

    support = min. support of nodes in b else for each xi in the header of the Tree { generate pattern b = xi construct bs conditional pattern base and bs

    conditional FP_tree Treeb if Treeb < > NULL then call FP_growth(Treeb, b) }

    Features Finds long frequent patterns by looking for shorter

    ones recursively Items in frequency descending order: the more

    frequently occurring, the more likely to be shared Main-memory based FP-tree Efficient and scalable Faster than Apriori

    ASSOCIATION RULE MININGGenerating Association Rules from Frequent Itemsets Strong association rules satisfy both minimum support and minimum confidence levels Confidence (A ( B)= P(B / A )= support_count(A U B) / support_count(A)

    Association rules For each frequent itemset l, generate all non-empty subsets of l For every non-empty subset s of l, output s (l-s) if sup_count(l) / sup_count(s) >= min_conf

    ExampleI = {I1, I2, I5} Confidence Threshold : 70%Non empty subsets: {I1, I2}, {I1, I5}, {I2, I5}{I1}, {I2}, {I5}I1 ( I2 I5, confidence = 2 /4 = 50%I1 ( I5 I2, confidence = 2 /2 = 100%I2 ( I5 I1, confidence = 2 /2 = 100%I1 ( I2 I5, confidence = 2 /6 = 33%I2 ( I1 I5, confidence = 2 /7 = 29%I5 ( I1 I2, confidence = 2 /2 = 100%

    Improving the Efficiency of Apriori Hash based technique Transaction reduction A transaction which does not contain k frequent itemsets cannot contain k+1 frequent itemsets

    Partitioning Sampling Dynamic itemset counting Start points

    Hash Based TechniquePartition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns

    Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Can use a lower support threshold

    Scan database once to verify frequent itemsets found in sample Scan database again to find missed frequent patterns

    Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2i100 # of scans: 100 # of Candidates: = 2100-1 = 1.27*1030

    Bottleneck: candidate-generation-and-test Avoid candidate generation

    Mining Frequent Patterns Without Candidate Generation FP Growth Divide and Conquer technique FP-Tree Grow long patterns from short ones using local frequent items

    FP-tree from a Transaction Database - ExampleFP-Growth For each frequent length-1 pattern(Suffix pattern): Construct conditional pattern base (Sub-database consisting of set of prefix paths co-occurring with suffix) Construct conditional FP-tree and mine recursively Generate all combinations of frequent patterns by combing with suffix

    FP-GrowthAlgorithmInput: A transaction db D; min_supOutput: Frequent patternsConstruction of FP-Tree Scan database, collect frequent items F and sort in descending order of support Create root of FP-tree labeled nullFor each Trans, sort in descending order [p|P]Insert_tree([p|P],T)If T has a child N = p, increment countelse create new node with count 1 and set parent and node linksIf P is non-empty call insert_tree(P,N) recursively

    AlgorithmProcedure FP_growth (Tree, a)If Tree contains a single path P thenfor each combination of nodes- b generate b ( a with support = min. support of nodes in belse for each xi in the header of the Tree{generate pattern b = xi a with support = xi.supportconstruct bs conditional pattern base and bs conditional FP_tree Treebif Treeb < > NULL then call FP_growth(Treeb, b)}

    Features Finds long frequent patterns by looking for shorter ones recursively Items in frequency descending order: the more frequently occurring, the more likely to be shared Main-memory based FP-tree Efficient and scalable Faster than Apriori