CS6905 - Data Mining using OLAP

32
CS6905 Data Mining (by Daniel Lemire) CS6905 - Data Mining using OLAP OLAP primarily supports user-driven queries. However, many data warehouses are used for data mining. What the relationship between the two?

description

 

Transcript of CS6905 - Data Mining using OLAP

Page 1: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

CS6905 - Data Mining using OLAP

OLAP primarily supports user-driven queries.

However, many data warehouses are used for data mining.

What the relationship between the two?

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 2: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Buzz words

� Differences between OLAP and Data Mining

� OLAP as a Deductive Process

� Association Rules

� Attribute-Value Focusing

� Iceberg Queries

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 3: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

References for this lecture

✔ The course textbook

✔ OLAP Mining: An integration of OLAP with Data Mining, Jiawei Han

✔ Discovery of Multiple-Level Association Rules from Large Databases,

Jiawei Han and Yongjian Fu

✔ Computing Iceberg Queries Efficiently, Min Fang et al.

✔ Many more!

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 4: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

What is data mining?

✔ Fashionable industry term (danger, danger)

✔ Han defines data mining as...

✔ discover some non trivial and interesting knowledge or

patterns

✔ My definition: precise answers to unprecise queries

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 5: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Is OLAP a form of data mining?

✔ NO.

✔ OLAP is meant for end-user

✔ Data Mining is for experts

✔ OLAP provides views

✔ Data Mining provides rules, relations

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 6: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

What does our textbook say?

✔ OLAP provides descriptive modelling

✔ Data Mining provides explanatory modelling

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 7: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Briefly recall

✔ deduction : from general to specific applying rules to

instances

✔ induction : for specific to general finding some rules out

of many facts

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 8: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

What course summary says?

✔ OLAP supports deductive analysis user provides a

rule, and it is tested and made precise

✔ Data Mining supports inductive analysis feed in the

data source, find a rule

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 9: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

What does Data Mining do?

✔ Characterization: do not breathe ⇔ dead

✔ Comparison: dogs are bigger than cats

✔ Classification: caucasian, asian, african

✔ Association: sunburn when I go outside

✔ Prediction: you are likely to like beer and beautiful

women

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 10: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

An Itemset

An itemset is simply a non-empty set of attribute values.

Typically, itemsets are large.

k-itemsets are itemsets containing at least k elements

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 11: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Association Rules

✔ Formally defined as A1∧ . . .∧Ai →∧B1∧ . . .∧B j

✔ Support of A1∧ . . .∧Ai ? P(A1∧ . . .∧Ai)

✔ Confidence of rule B1∧ . . .∧B j → A1∧ . . .∧Ai:

P(B1∧. . .∧B j|A1∧. . .∧Ai)=P(B1∧ . . .∧B j ∧A1∧ . . .∧Ai)

P(A1∧ . . .∧Ai).

(Some authors define support differently)

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 12: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Association Rules: Example

Monster Species Color vegetarian?

Ziziz Red yes

YiYoz Blue yes

Filoufoul Red no

Coucou Red yes

Passpass Blue yes

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 13: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Association Rules: Example

� BLUE monsters are vegetarian support = 40%, confi-

dence = 100%

� RED monsters are vegetarian support = 40%, confi-

dence = 66.7%

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 14: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Strong Association Rules

� User sets support and confidence thresholds e.g. at least

100 relations, 80% confidence

� Rules above support threshold have Large support.

� Rules above confidence threshold have High confidence.

� Rules satisfying both are said to be Strong .

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 15: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Classical Association Rules Mining

� The classical reference is Fast Algorithms for Mining As-

sociation Rules by Agrawal and Srikant.

� They presented the Apriori algorithm: the reference algo-

rithm.

� Not very OLAPish though.

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 16: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Back to OLAP!

� To understand, we turn the monster database into a cube

GROUP BY Species

red blue

vegetarian 2 2

eats humans 1 0

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 17: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Are cubes always related to support?

� Look at these sales... Where’s the support?

Week Days Week-Ends

hot-dogs 432.32$ 132$

fries 332.35$ 745.12$

� GROUP BY cube give support for free, not other cubes!

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 18: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Ok, so that’s statistics, right?

� No.

� Statistics samples a problem, uses a model to predict

� OLAP does brute force computation

� Recall that OLAP wasn’t thinkable when statistics became

popular

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 19: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Ok, so it’s simple, right?

� No.

� Efficient methods to automatically search for strong rules

exist∗

� They often fail to be useful

� Machines don’t do pattern recognition well!

(*) Agrawal et al., Mining association rules between sets of items in large database.

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 20: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

General Association Rule Mining

� First Focus on Getting (at least) Minimum Support

� Out of Instances left go for (at least) Minimum Confi-

dence out of what is left

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 21: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Sample data

ok cheese yellow cheese skim milk 1% milk 2% milk fat milk

whole wheat 12 0 43 13 22 0

white bread 8 16 432 3304 4343 444

brown bread 0 32 2 99 441 4324

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 22: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Simplistic Han-Fu algorithm

� Support threshold at 30%, user wants relation against

bread

� First test relation cheese vs bread, small support, skip

� Test relation milk vs bread, possible!

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 23: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Finer scale data

skim milk 1% milk 2% milk plenty of fat · · ·whole wheat 43 13 22 0 · · ·white bread 432 3304 4343 444 · · ·brown bread 2 99 441 4324 · · ·

... ... ... ... ... . . .

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 24: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Attribute-Value Focusing

� Support threshold: P(A1∧ . . .Ai) > 0.3

� Skim milk vs bread, small support, skip

� Test relation 1% milk vs bread, small support (25%)

� Test relation 2% milk vs bread, possible!

� Test relation fat milk vs bread, possible!

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 25: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Finer scale Han-Fu

� Support threshold at 30% (relative)

� Skim milk vs bread, small support, skip

� Test relation 1% milk vs bread, small support (25%)

� Test relation 2% milk vs bread, possible!

� Test relation fat milk vs bread, possible!

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 26: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Conclusion of simplified Han-Fu

� Would spot brown bread ↔ plenty of fat

� Would spot 2% = whole wheat

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 27: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

whole wheat ↔ skim milk

� Whole Wheat has low support

� Automated Rule Mining likely to fail

� Lowering support will not help!

� Confidence/support not enough to define interestingness

� Need Human Pattern Recognition!

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 28: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Item Tuples

� To find association rules, almost enough to find Item

Pairs∗

� Item Pairs:

. (white bread, 1% milk) : 3304

. (white bread, 2% milk) : 4343

. (brown bread, fatty milk): 4324(*) Park et al., An effective hash based algorithm for mining association rules.

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 29: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Iceberg Cubes

� Data Cube Ci1,...,ik (positive values, GROUP-BY)

� Choose threshold ε > 0.

� Iceberg Cube ICi1,...,ik =

Ci1,...,ik if Ci1,...,ik > ε0 otherwise

� Effectively ignores region of small support.Min Fang et al. Computing Iceberg Queries Efficiently.

Beyer and Ramakrishnan, Bottom-Up Computation of Sparse and Iceberg CUBEs.

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 30: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Iceberg Cube ( ε = 4000)

skim milk 1% milk 2% milk plenty of fat · · ·whole wheat 0 0 0 0 · · ·white bread 0 0 4343 444 · · ·brown bread 0 0 0 4324 · · ·

... ... ... ... ... . . .

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 31: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

General Iceberg Cubes

� Our definition only works for GROUP BY cubes.

� More general case is

SELECT A,B,C,COUNT(*),SUM(X) FROM R CUBE

BY A,B,C HAVING COUNT(*) >= N

� as opposed to simple iceberg queries

SELECT A,B,C,COUNT(*) FROM R GROUP BY

A,B,C HAVING COUNT(*) >= N

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖

Page 32: CS6905 - Data Mining using OLAP

CS6905 Data Mining (by Daniel Lemire)

Benefits?

� Faster analysis of association rules∗ : only need to focus

on minimal confidence

� Really Sparse Cubes (storage)(*) Kamber et al. Metarule-guided mining of multi-dimensional association rules using data cubes.

Lemire - CS6905ateb ➠ ➡➡ ➠ ■ ✖