Chapter 05 Association Rule Mining - sherbold.github.io · Example for Generating Rules ⢠support...
Transcript of Chapter 05 Association Rule Mining - sherbold.github.io · Example for Generating Rules ⢠support...
-
Chapter 05
Association Rule Mining
Dr. Steffen Herbold
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
-
Outline
⢠Overview
⢠The Apriori Algorithm
⢠Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
-
Example of Association Rules
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Items already in basket Available items Item likely to be added+
-
The General Problem
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠âŠ
Association
Rule Mining
Association Rules
⢠item2 â item3
⢠item2 â item4
⢠item2,item3 â item4
⢠âŠ
Rules describe
âinteresting relationshipsâ
-
The Formal Problem
⢠Items⢠ðŒ = {ð1, ð2, ⊠, ðð}
⢠Transactions⢠ð = {ð¡1, ⊠, ð¡ð} with ð¡ð â ðŒ
⢠Rules⢠ð â ð such that ð, ð â ðŒ and ð â© ð = â
⢠ð is also called antecedent or left-hand-side
⢠ð is also called consequent or right-hand-side
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
get good
rules?
-
Defining âInteresting Relationshipsâ
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠âŠ
item2, item3,
item4 occur often
together
Interesting == often together
-
Outline
⢠Overview
⢠The Apriori Algorithm
⢠Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
-
Support and Frequent Item Sets
⢠Support⢠Percentage of occurances of an itemset ðŒð
⢠ð ð¢ððððð¡ ðŒð =| ð¡âðâ¶ðŒðâð¡ |
|ð|
⢠Frequent item set⢠Itemsets that appear together âoften enoughâ
⢠Defined using a threshold
⢠ðŒð â ðŒ is frequent if ð ð¢ððððð¡ ðŒð ⥠ðððð ð¢ðð
⢠Rules can be generated by splitting itemsets⢠ðŒð = ð ⪠ð
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
-
Example for Generating Rules
⢠support ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 =3
10⥠0.3
⢠Eight possible rules: ⢠â â {ðð¡ðð2, ðð¡ðð3, ðð¡ðð4}
⢠{ðð¡ðð2} â {ðð¡ðð3, ðð¡ðð4}
⢠{ðð¡ðð3} â {ðð¡ðð2, ðð¡ðð4}
⢠{ðð¡ðð4} â {ðð¡ðð2, ðð¡ðð3}
⢠{ðð¡ðð2, ðð¡ðð3} â {ðð¡ðð4}
⢠{ðð¡ðð2, ðð¡ðð4} â {ðð¡ðð3}
⢠{ðð¡ðð3, ðð¡ðð4} â {ðð¡ðð2}
⢠ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 â â
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠item6,item7
ðððð ð¢ðð = 0.3
Are all rules
interesting?
-
Confidence, Lift, and Leverage
⢠Confidence⢠Percentage of transactions that contain the antecedent, which also
contain consequent
⢠ðððððððððð ð â ð =ð ð¢ððððð¡(ðâªð)
ð ð¢ððððð¡(ð)=
| ð¡âð:ðâªðâð |
| ð¡âð:ðâð |
⢠Lift⢠Ratio of the probability of ð and ð together and independently
⢠ðððð¡ ð â Y =ð ð¢ððððð¡(ðâªð)
ð ð¢ððððð¡ ð â ð ð¢ððððð¡(ð)
⢠Leverage⢠Difference in the probability of ð and ð together and independently
⢠ððð£ððððð ð â ð = ð ð¢ððððð¡ ð ⪠ð â ð ð¢ððððð¡ ð¥ â ð ð¢ððððð¡(ð)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Usually, lift favors itemsets with lower
support, leverage with higher support
-
Confidence for the Example Rules
⢠ðððððððððð â â ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 =0.3
1= 0.3
⢠ðððððððððð ðð¡ðð2 â ðð¡ðð3, ðð¡ðð4 =0.3
0.6= 0.5
⢠ðððððððððð ðð¡ðð3 â ðð¡ðð2, ðð¡ðð4 =0.3
0.4= 0.75
⢠ðððððððððð ðð¡ðð4 â ðð¡ðð2, ðð¡ðð3 =0.3
0.6= 0.5
⢠ðððððððððð ðð¡ðð2, ðð¡ðð3 â ðð¡ðð4 =0.3
0.4= 0.75
⢠ðððððððððð ðð¡ðð2, ðð¡ðð4 â ðð¡ðð3 =0.3
0.5= 0.6
⢠ðððððððððð ðð¡ðð3, ðð¡ðð4 â ðð¡ðð2 =0.3
0.3= 1
⢠ðððððððððð ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 â â =0.3
0.3= 1
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠item6,item7
-
Lift for the Example Rules
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
⢠ðððð¡ â â ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 =0.3
1â 0.3= 1
⢠ðððð¡ ðð¡ðð2 â ðð¡ðð3, ðð¡ðð4 =0.3
0.6â 0.3= 1.66
⢠ðððð¡ ðð¡ðð3 â ðð¡ðð2, ðð¡ðð4 =0.3
0.4â 0.5= 1.5
⢠ðððð¡ ðð¡ðð4 â ðð¡ðð2, ðð¡ðð3 =0.3
0.6â 0.4= 1.25
⢠ðððð¡ ðð¡ðð2, ðð¡ðð3 â ðð¡ðð4 =0.3
0.4â 0.6= 1.25
⢠ðððð¡ ðð¡ðð2, ðð¡ðð4 â ðð¡ðð3 =0.3
0.5â 0.4= 1.5
⢠ðððð¡ ðð¡ðð3, ðð¡ðð4 â ðð¡ðð2 =0.3
0.3â 0.6= 1.66
⢠ðððð¡ ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 â â =0.3
0.3â 1= 1
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠item6,item7
-
Overview of Scores for Example
Rule Confidence Lift Leverage
â â ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 0.30 1.00 0.00
ðð¡ðð2 â ðð¡ðð3, ðð¡ðð4 0.50 1.66 0.12
ðð¡ðð3 â ðð¡ðð2, ðð¡ðð4 0.75 1.50 0.10
ðð¡ðð4 â ðð¡ðð2, ðð¡ðð3 0.50 1.25 0.06
ðð¡ðð2, ðð¡ðð3 â ðð¡ðð4 0.75 1.25 0.06
ðð¡ðð2, ðð¡ðð4 â ðð¡ðð3 0.60 1.50 0.10
ðð¡ðð3, ðð¡ðð4 â ðð¡ðð2 1.00 1.66 0.12
ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 â â 1.00 1.00 0.00
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Perfect confidence, but
no gain over randomness
Perfect confidence, and
1.66 times more likely
than randomness
-
Itemsets and Rules = Exponential
⢠Number of itemset is exponential⢠All possible itemsets are the powerset ð« of ðŒ
⢠ð« ðŒ = 2|ðŒ|
⢠Still exponential if we restrict the size⢠|ðŒ| itemsets with ð = 1 items
â¢ðŒ â ( ðŒ â1)
2itemsets with ð = 2 items
â¢|ðŒ|ð
=ðŒ !
ðŒ âð !ð!itemsets with ð items
⢠Number of rules per itemset is exponential⢠Possible antecedents of itemset ð are the powerset ð« of ð
⢠ð« ð = 2|ð|
⢠Example: ðŒ = 100, ð = 3⢠161,700 possible itemsets
⢠1,293,600 possible rules
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do we
restrict the
search space?
-
Pruning the Search Space
⢠Apriori Property⢠All subsets of frequent itemsets are also frequent
⢠ð ð¢ððððð¡ ðⲠ⥠ð ð¢ððððð¡ ð for all ðâ² â ð, ð â ðŒ
⢠âGrowâ itemsets and prune search space by applying Apriori property
⢠Start with itemsets of size k = 1
⢠Drop all itemsets that do not have minimal support
⢠Build all combinations of size k + 1
⢠Repeat until⢠No itemsets with minimal support are found
⢠A threshold for ð is reached
⢠Still has exponential complexity, but better bounded
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
-
Example for Growing Itemsets (k=1)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠item6,item7
ðððð ð¢ðð = 0.3
Rule Support
{ðð¡ðð1} 0.2
{ðð¡ðð2} 0.6
{ðð¡ðð3} 0.4
{ðð¡ðð4} 0.5
{ðð¡ðð5} 0.3
{ðð¡ðð6} 0.2
{ðð¡ðð7} 0.3
{ðð¡ðð8} 0.1
Drop
Drop
Drop
-
Example for Growing Itemsets (k=2)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠item6,item7
ðððð ð¢ðð = 0.3
Rule Support
{ðð¡ðð2, item3} 0.4
{ðð¡ðð2, item4} 0.5
{ðð¡ðð2, item5} 0.1
{ðð¡ðð2, item7} 0.1
{ðð¡ðð3, item4} 0.3
{ðð¡ðð3, item5} 0.0
{ðð¡ðð3, item7} 0.1
⊠âŠ
Drop
Drop
Drop
Drop
Drop
-
Example for Growing Itemsets (k=3)
⢠Found the following frequent itemsetswith at least two items:
⢠ðð¡ðð2, ðð¡ðð3 , ðð¡ðð2, ðð¡ðð4 , ðð¡ðð3, ðð¡ðð4
⢠{ðð¡ðð2, ðð¡ðð3, ðð¡ðð4}
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Set of Transactions
⢠item1,item2,item3
⢠item2,item4
⢠item1,item5
⢠item6,item7
⢠item2,item3,item4,item7
⢠item2,item3,item4,item8
⢠item2,item4,item5
⢠item2,item3,item4
⢠item4,item5
⢠item6,item7
ðððð ð¢ðð = 0.3
Rule Support
{ðð¡ðð2, item3, item4} 0.3
Only itemset
remaining, growing
terminates.
-
Candidates for Rules
⢠Usually, not all possible rules are considered
⢠Two common restrictions:⢠No empty antecedents and consequents
⢠ð â â and ð â â
⢠Only one item as consequent⢠ð = 1
⢠Example:
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
â â {ðð¡ðð2, ðð¡ðð3, ðð¡ðð4}{ðð¡ðð2} â {ðð¡ðð3, ðð¡ðð4}{ðð¡ðð3} â {ðð¡ðð2, ðð¡ðð4}{ðð¡ðð4} â {ðð¡ðð2, ðð¡ðð3}{ðð¡ðð2, ðð¡ðð3} â {ðð¡ðð4}{ðð¡ðð2, ðð¡ðð4} â {ðð¡ðð3}{ðð¡ðð3, ðð¡ðð4} â {ðð¡ðð2}ðð¡ðð2, ðð¡ðð3, ðð¡ðð4 â â
-
Evaluating Association Rules
⢠Use different criteria, not just support and confidence⢠Lift and leverage can tell you if rules are coincidental
⢠Validate if rules hold on test data⢠Check if the rules would also be found on the test data
⢠For basket prediction, use incomplete itemsets on test data⢠Example: remove item4 from all itemsets and see if the rules would
correctly predict where it is associated
⢠Manually inspect rules⢠Do they make sense?
⢠Ask domain experts
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
-
Outline
⢠Overview
⢠The Apriori Algorithm
⢠Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
-
Summary
⢠Associations are interesting relationships between items
⢠Interesting usually means appear together and not coincidental
⢠Number of possible itemsets/rules exponential⢠Apriori property for bounding itemsets
⢠Restrictions on rule structures
⢠Test data and manual inspections for validation
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science