CS6905 - Data Mining using OLAP
description
Transcript of CS6905 - Data Mining using OLAP
CS6905 Data Mining (by Daniel Lemire)
CS6905 - Data Mining using OLAP
OLAP primarily supports user-driven queries.
However, many data warehouses are used for data mining.
What the relationship between the two?
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Buzz words
� Differences between OLAP and Data Mining
� OLAP as a Deductive Process
� Association Rules
� Attribute-Value Focusing
� Iceberg Queries
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
References for this lecture
✔ The course textbook
✔ OLAP Mining: An integration of OLAP with Data Mining, Jiawei Han
✔ Discovery of Multiple-Level Association Rules from Large Databases,
Jiawei Han and Yongjian Fu
✔ Computing Iceberg Queries Efficiently, Min Fang et al.
✔ Many more!
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
What is data mining?
✔ Fashionable industry term (danger, danger)
✔ Han defines data mining as...
✔ discover some non trivial and interesting knowledge or
patterns
✔ My definition: precise answers to unprecise queries
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Is OLAP a form of data mining?
✔ NO.
✔ OLAP is meant for end-user
✔ Data Mining is for experts
✔ OLAP provides views
✔ Data Mining provides rules, relations
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
What does our textbook say?
✔ OLAP provides descriptive modelling
✔ Data Mining provides explanatory modelling
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Briefly recall
✔ deduction : from general to specific applying rules to
instances
✔ induction : for specific to general finding some rules out
of many facts
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
What course summary says?
✔ OLAP supports deductive analysis user provides a
rule, and it is tested and made precise
✔ Data Mining supports inductive analysis feed in the
data source, find a rule
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
What does Data Mining do?
✔ Characterization: do not breathe ⇔ dead
✔ Comparison: dogs are bigger than cats
✔ Classification: caucasian, asian, african
✔ Association: sunburn when I go outside
✔ Prediction: you are likely to like beer and beautiful
women
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
An Itemset
An itemset is simply a non-empty set of attribute values.
Typically, itemsets are large.
k-itemsets are itemsets containing at least k elements
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Association Rules
✔ Formally defined as A1∧ . . .∧Ai →∧B1∧ . . .∧B j
✔ Support of A1∧ . . .∧Ai ? P(A1∧ . . .∧Ai)
✔ Confidence of rule B1∧ . . .∧B j → A1∧ . . .∧Ai:
P(B1∧. . .∧B j|A1∧. . .∧Ai)=P(B1∧ . . .∧B j ∧A1∧ . . .∧Ai)
P(A1∧ . . .∧Ai).
(Some authors define support differently)
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Association Rules: Example
Monster Species Color vegetarian?
Ziziz Red yes
YiYoz Blue yes
Filoufoul Red no
Coucou Red yes
Passpass Blue yes
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Association Rules: Example
� BLUE monsters are vegetarian support = 40%, confi-
dence = 100%
� RED monsters are vegetarian support = 40%, confi-
dence = 66.7%
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Strong Association Rules
� User sets support and confidence thresholds e.g. at least
100 relations, 80% confidence
� Rules above support threshold have Large support.
� Rules above confidence threshold have High confidence.
� Rules satisfying both are said to be Strong .
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Classical Association Rules Mining
� The classical reference is Fast Algorithms for Mining As-
sociation Rules by Agrawal and Srikant.
� They presented the Apriori algorithm: the reference algo-
rithm.
� Not very OLAPish though.
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Back to OLAP!
� To understand, we turn the monster database into a cube
GROUP BY Species
�
red blue
vegetarian 2 2
eats humans 1 0
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Are cubes always related to support?
� Look at these sales... Where’s the support?
�
Week Days Week-Ends
hot-dogs 432.32$ 132$
fries 332.35$ 745.12$
� GROUP BY cube give support for free, not other cubes!
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Ok, so that’s statistics, right?
� No.
� Statistics samples a problem, uses a model to predict
� OLAP does brute force computation
� Recall that OLAP wasn’t thinkable when statistics became
popular
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Ok, so it’s simple, right?
� No.
� Efficient methods to automatically search for strong rules
exist∗
� They often fail to be useful
� Machines don’t do pattern recognition well!
(*) Agrawal et al., Mining association rules between sets of items in large database.
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
General Association Rule Mining
� First Focus on Getting (at least) Minimum Support
� Out of Instances left go for (at least) Minimum Confi-
dence out of what is left
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Sample data
ok cheese yellow cheese skim milk 1% milk 2% milk fat milk
whole wheat 12 0 43 13 22 0
white bread 8 16 432 3304 4343 444
brown bread 0 32 2 99 441 4324
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Simplistic Han-Fu algorithm
� Support threshold at 30%, user wants relation against
bread
� First test relation cheese vs bread, small support, skip
� Test relation milk vs bread, possible!
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Finer scale data
skim milk 1% milk 2% milk plenty of fat · · ·whole wheat 43 13 22 0 · · ·white bread 432 3304 4343 444 · · ·brown bread 2 99 441 4324 · · ·
... ... ... ... ... . . .
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Attribute-Value Focusing
� Support threshold: P(A1∧ . . .Ai) > 0.3
� Skim milk vs bread, small support, skip
� Test relation 1% milk vs bread, small support (25%)
� Test relation 2% milk vs bread, possible!
� Test relation fat milk vs bread, possible!
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Finer scale Han-Fu
� Support threshold at 30% (relative)
� Skim milk vs bread, small support, skip
� Test relation 1% milk vs bread, small support (25%)
� Test relation 2% milk vs bread, possible!
� Test relation fat milk vs bread, possible!
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Conclusion of simplified Han-Fu
� Would spot brown bread ↔ plenty of fat
� Would spot 2% = whole wheat
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
whole wheat ↔ skim milk
� Whole Wheat has low support
� Automated Rule Mining likely to fail
� Lowering support will not help!
� Confidence/support not enough to define interestingness
� Need Human Pattern Recognition!
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Item Tuples
� To find association rules, almost enough to find Item
Pairs∗
� Item Pairs:
. (white bread, 1% milk) : 3304
. (white bread, 2% milk) : 4343
. (brown bread, fatty milk): 4324(*) Park et al., An effective hash based algorithm for mining association rules.
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Iceberg Cubes
� Data Cube Ci1,...,ik (positive values, GROUP-BY)
� Choose threshold ε > 0.
� Iceberg Cube ICi1,...,ik =
Ci1,...,ik if Ci1,...,ik > ε0 otherwise
� Effectively ignores region of small support.Min Fang et al. Computing Iceberg Queries Efficiently.
Beyer and Ramakrishnan, Bottom-Up Computation of Sparse and Iceberg CUBEs.
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Iceberg Cube ( ε = 4000)
skim milk 1% milk 2% milk plenty of fat · · ·whole wheat 0 0 0 0 · · ·white bread 0 0 4343 444 · · ·brown bread 0 0 0 4324 · · ·
... ... ... ... ... . . .
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
General Iceberg Cubes
� Our definition only works for GROUP BY cubes.
� More general case is
SELECT A,B,C,COUNT(*),SUM(X) FROM R CUBE
BY A,B,C HAVING COUNT(*) >= N
� as opposed to simple iceberg queries
SELECT A,B,C,COUNT(*) FROM R GROUP BY
A,B,C HAVING COUNT(*) >= N
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖
CS6905 Data Mining (by Daniel Lemire)
Benefits?
� Faster analysis of association rules∗ : only need to focus
on minimal confidence
� Really Sparse Cubes (storage)(*) Kamber et al. Metarule-guided mining of multi-dimensional association rules using data cubes.
Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖