Data Mining-Association Mining 2

ASSOCIATION RULE MINING

Generating Association Rules from Frequent Itemsets Strong association rules satisfy both minimum

support and minimum confidence levels Confidence (A B)

= P(B / A ) = support_count(A U B) / support_count(A)

Association rules For each frequent itemset l, generate all non-

empty subsets of l For every non-empty subset s of l, output s -s)

if sup_count(l) / sup_count(s) >= min_conf

Example I = {I1, I2, I5} Confidence Threshold : 70% Non empty subsets: {I1, I2}, {I1, I5}, {I2, I5} {I1}, {I2}, {I5} I1 I1 I2 I I1 I2 I5

Improving the Efficiency of Apriori Hash based technique Transaction reduction A transaction which does not contain k frequent

itemsets cannot contain k+1 frequent itemsets Partitioning

Sampling Dynamic itemset counting Start points

Hash Based Technique Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must

be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent

patterns Scan 2: consolidate global frequent patterns

Sampling for Frequent Patterns Select a sample of original database, mine frequent

patterns within sample using Apriori Can use a lower support threshold

Scan database once to verify frequent itemsets found in sample

Scan database again to find missed frequent patterns

Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning

and generates lots of candidates To find frequent itemset i1i2i100 # of scans: 100 # of Candidates: = 2100-1 = 1.27*1030

Bottleneck: candidate-generation-and-test Avoid candidate generation

Mining Frequent Patterns Without Candidate Generation FP Growth Divide and Conquer technique FP-Tree Grow long patterns from short ones using local

frequent items FP-tree from a Transaction Database - Example FP-Growth For each frequent length-1 pattern(Suffix pattern): Construct conditional pattern base (Sub-database

consisting of set of prefix paths co-occurring with suffix)

Construct conditional FP-tree and mine recursively

Generate all combinations of frequent patterns by combing with suffix

FP-Growth

Algorithm Input: A transaction db D; min_sup Output: Frequent patterns Construction of FP-Tree Scan database, collect frequent items F and sort in

descending order of support Create root of FP-tree labeled null For each Trans, sort in descending order [p|P] Insert_tree([p|P],T) If T has a child N = p, increment count else create new node with count 1 and set

parent and node links If P is non-empty call insert_tree(P,N) recursively

Algorithm Procedure FP_growth (Tree, a) If Tree contains a single path P then for each combination of nodes- b generate b a with

support = min. support of nodes in b else for each xi in the header of the Tree { generate pattern b = xi construct bs conditional pattern base and bs

conditional FP_tree Treeb if Treeb < > NULL then call FP_growth(Treeb, b) }

Features Finds long frequent patterns by looking for shorter

ones recursively Items in frequency descending order: the more

frequently occurring, the more likely to be shared Main-memory based FP-tree Efficient and scalable Faster than Apriori

ASSOCIATION RULE MININGGenerating Association Rules from Frequent Itemsets Strong association rules satisfy both minimum support and minimum confidence levels Confidence (A ( B)= P(B / A )= support_count(A U B) / support_count(A)

Association rules For each frequent itemset l, generate all non-empty subsets of l For every non-empty subset s of l, output s (l-s) if sup_count(l) / sup_count(s) >= min_conf

ExampleI = {I1, I2, I5} Confidence Threshold : 70%Non empty subsets: {I1, I2}, {I1, I5}, {I2, I5}{I1}, {I2}, {I5}I1 ( I2 I5, confidence = 2 /4 = 50%I1 ( I5 I2, confidence = 2 /2 = 100%I2 ( I5 I1, confidence = 2 /2 = 100%I1 ( I2 I5, confidence = 2 /6 = 33%I2 ( I1 I5, confidence = 2 /7 = 29%I5 ( I1 I2, confidence = 2 /2 = 100%

Improving the Efficiency of Apriori Hash based technique Transaction reduction A transaction which does not contain k frequent itemsets cannot contain k+1 frequent itemsets

Partitioning Sampling Dynamic itemset counting Start points

Hash Based TechniquePartition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns

Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Can use a lower support threshold

Scan database once to verify frequent itemsets found in sample Scan database again to find missed frequent patterns

Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2i100 # of scans: 100 # of Candidates: = 2100-1 = 1.27*1030

Bottleneck: candidate-generation-and-test Avoid candidate generation

Mining Frequent Patterns Without Candidate Generation FP Growth Divide and Conquer technique FP-Tree Grow long patterns from short ones using local frequent items

FP-tree from a Transaction Database - ExampleFP-Growth For each frequent length-1 pattern(Suffix pattern): Construct conditional pattern base (Sub-database consisting of set of prefix paths co-occurring with suffix) Construct conditional FP-tree and mine recursively Generate all combinations of frequent patterns by combing with suffix

FP-GrowthAlgorithmInput: A transaction db D; min_supOutput: Frequent patternsConstruction of FP-Tree Scan database, collect frequent items F and sort in descending order of support Create root of FP-tree labeled nullFor each Trans, sort in descending order [p|P]Insert_tree([p|P],T)If T has a child N = p, increment countelse create new node with count 1 and set parent and node linksIf P is non-empty call insert_tree(P,N) recursively

AlgorithmProcedure FP_growth (Tree, a)If Tree contains a single path P thenfor each combination of nodes- b generate b ( a with support = min. support of nodes in belse for each xi in the header of the Tree{generate pattern b = xi a with support = xi.supportconstruct bs conditional pattern base and bs conditional FP_tree Treebif Treeb < > NULL then call FP_growth(Treeb, b)}

Features Finds long frequent patterns by looking for shorter ones recursively Items in frequency descending order: the more frequently occurring, the more likely to be shared Main-memory based FP-tree Efficient and scalable Faster than Apriori

Data Mining-Association Mining 2

Documents

Transcript of Data Mining-Association Mining 2