Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling...

41
Data Analytics Beyond OLAP Prof. Yanlei Diao

Transcript of Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling...

Page 1: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Data Analytics Beyond OLAP

Prof. Yanlei Diao

Page 2: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

OPERATIONAL DBs

EXTRACT TRANSFORM

LOAD (ETL)

DATA WAREHOUSE

METADATA STORE

DB 1 DB 2 DB 3

SUPPORTS

OLAP DATA

MINING

INTERACTIVE DATA EXPLORATION

Page 3: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Overview of Topics

q Data Mining in Databases •  Association rule mining

•  Interesting visualizations

q Approximate Query Processing •  Online aggregation: group by aggregation, wander join

•  Interactive SQL

q Interactive Data Exploration •  Faceted search

•  Semantic windows

•  Explore by example

Page 4: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

1. Association Rule Mining

Fast Algorithms for Mining Association Rules

Rakesh Agrawal and Ramakrishnan Srikant VLDB '94

Page 5: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Motivation

•  Example Rules: –  98% of customers who purchase tires get automotive

services done –  Customers who buy mustard and ketchup also buy burgers –  Goal: find these rules from transactional data

•  Rules help with decision making –  E.g., store layout, buying patterns, add-on sales

Page 6: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

DB of "Basket Data" TID items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5

Association Mining

Association Rules {1} => {3} {2,3} => {5} {2,5} => {3}

. . .

Page 7: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

DB of "Basket Data" TID items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5

Associate Rules

Association Rules {1} => {3} {2,3} => {5} {2,5} => {3}

Association rule: X => Y q X and Y are disjoint itemsets, called antecedent (LHS) and consequent (RHS)

•  Confidence: c% of transactions that contain X also contain Y (rule-specific) •  Support: s% of all transactions contain both X and Y (relative to all data)

q Goal: find all rules that satisfy the confidence and support thresholds.

.

.

.

Page 8: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Support Example

TID Cereal Beer Bread Bananas Milk1 X X X2 X X X X3 X X4 X X5 X X6 X X7 X X8 X

Support(Cereal)4/8=.5

Support(Cereal=>Milk)3/8=.375

Page 9: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Confidence Example

TID Cereal Beer Bread Bananas Milk1 X X X2 X X X X3 X X4 X X5 X X6 X X7 X X8 X

Confidence(Cereal=>Milk)3/4=.75

Confidence(Bananas=>Bread)1/3=.33333…

Page 10: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Apriori Algorithm and Notation

•  {i1, i2, …, im} be the set of literals, known as items

•  { Tj } is the set of transactions (database), where each transaction Tj is a set of items s.t.

–  Each transaction has a unique identifier TID

–  The size of an itemset is the number of items

–  Itemset of size k is a k-itemset

•  Assume that items in an itemset are sorted in lexicographical order

Page 11: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

•  Step I: Find all itemsets with minimum support (min_sup s)

•  Step II: Generate rules from min_sup'ed itemsets

General Strategy

support itemsets0.25 {4}, {1,2}, {1,4}, {1,5}, {3,4},

{1,3,4}, {1,2,3}, {1,2,5}, {1,3,5},{1,2,3,5}

0.5 {1}, {1,3}, {2,3}, {3,5}, {2,3,5}0.75 {2}, {3}, {5}, {2,5}

support confidence rules0.5 66% {3}=>{1}, {3}=>{2}, {2}=>{3}, {3}=>{5},

{5}=>{3}, {5}=>{2,3}, {3}=>{2,5},{2}=>{3,5}, {5,2}=>{3}, {5,3}=>{2}

0.5 100% {1}=>{3}, {5,3}=>{2}, {2,3}=>{5}

0.75 100% {5}=>{2}, {2}=>{5}

TID items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Page 12: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Step I: Finding Minsup Itemsets

•  What is the complexity of finding all subsets of item that satisfy the mini_sup s?

•  The power set of the n literals!

•  A new algorithmic framework based on anti-monotonicity: •  For a frequent itemset, all of its subsets are also frequent •  For an infrequent itemset, all of its supersets must be infrequent •  Can be used to design efficient pruning of the search space.

Page 13: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Anti-monotonicity

•  Anti-monotonicity •  Adding items to an itemset never increases its support •  For a frequent itemset, all of its subsets are also frequent •  For an infrequent itemset, all of its supersets must be infrequent

frequent itemset Lk frequent itemset Lk-1

Page 14: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Anti-monotonicity

•  Anti-monotonicity •  Adding items to an itemset never increases its support •  For a frequent itemset, all of its subsets are also frequent •  For an infrequent itemset, all of its supersets must be infrequent

frequent itemset Lk-1 frequent itemset Lk super-itemset Ck

Page 15: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Anti-monotonicity

•  Anti-monotonicity •  Adding items to an itemset never increases its support •  For a frequent itemset, all of its subsets are also frequent •  For an infrequent itemset, all of its supersets must be infrequent

frequent itemset Lk-1 frequent itemset Lk super-itemset Ck

Page 16: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Step I: Finding Minsup Itemsets

•  Anti-monotonicity:

Adding items to an itemset never increases its support

•  Apriori Algorithm: Proceed inductively on itemset size

1) Base case: Begin with all minsup itemsets of size 1 (L1)

2) Without peeking at the DB, generate candidate itemsets of

size k (Ck) from Lk-1

3) Remove candidate itemsets that contain unsupported subsets

4) Further refine Ck using the database to produce Lk

repe

at

Page 17: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Task 2) Guess Itemsets

•  Naïve way: –  Extend all itemsets with all possible items

•  Apriori: 2)  Join Lk-1 with itself, adding only a single, final item

• e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2, 3, 4} produces {1 2 3 4} and {1 3 4 5}

Page 18: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Task 3) Filter Itemsets

•  Apriori: 2)  Join Lk-1 with itself, adding only a single, final item

• e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2, 3, 4} produces {1 2 3 4} and {1 3 4 5}

3)  Remove itemsets with an unsupported subset • e.g.: {1 3 4 5} has an unsupported subset: {1 4 5} if minsup

= 50%

Page 19: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Task 4) Finalize k-Itemsets

•  Apriori: 2)  Join Lk-1 with itself, adding only a single, final item

• E.g., {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2, 3, 4} produces {1 2 3 4} and {1 3 4 5}

3)  Remove itemsets with an unsupported subset • E.g., {1 3 4 5} has an unsupported subset: {1 4 5} if

minsup = 50% 4)  Use the database to further refine Ck

• Count precisely the occurrence of each itemset in the dataset, to see if it is indeed larger than min_sup

Page 20: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Full Algorithm

22

Page 21: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

23

Page 22: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

2. Interesting Visualizations

SeeDB: efficient data-driven visualization recommendations to support visual analytics

Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, Neoklis Polyzotis VLDB ’14

Page 23: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Visualization Recommendation

Given a dataset and a task, automatically produce a set of visualizations that are

the most “interesting” given the task

Page 24: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

For simplicity, assume a single table

(star schema)

Visualizations = agg. + grp. by queries

Vi = SELECT d, f(m)

FROM table WHERE ___

GROUP BY d

(d, m, f): dimension, measure, aggregate

1.5

2

2.5

3

3.5

4

4.5

2000

20

02

2004

20

06

2008

20

10

2012

50

10 10 30

MA CA IL NY

Space of Visualizations

Page 25: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Space of Visualizations

Vi = SELECT d, f(m) FROM table WHERE ___ GROUP BY d

(d, m, f): dimension, measure, aggregate

{d} : race, work-type, sex etc. {m} : capital-gain, capital-loss, hours-per-week {f} : COUNT, SUM, AVG

Page 26: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

I.  Interestingness: How do we determine if a visualization is interesting?

–  Utility Metric

II.  Scale: How to make recommendations efficiently and interactively?

–  Optimizations

Key Questions

Page 27: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

A visualization is interesting if it displays a large deviation from some reference

Task: compare unmarried adults with all adults

50

10 10

30

MA CA IL NY

30 20

10

40

MA CA IL NY

V1

V2

V1

V2

Compare induced probability distributions!

Target Reference

V1 = SELECT d, f(m) FROM table WHERE target GROUP BY d ︎V2 = SELECT d, f(m) FROM table WHERE reference GROUP BY d ︎

Deviation-based Utility Metric

Page 28: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

A visualization is interesting if it displays a large deviation from some reference

Many metrics for computing distance between distributions

V1

V2

D [P( V1), P(V2)]

Earth mover’s distance

L1, L2 distance K-L divergence

Any distance metric b/n

distributions is OK!

Deviation-based Utility Metric

Page 29: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Overview of Topics

q Data Mining over Databases •  Association rule mining

•  Interesting visualizations

q Approximate Query Processing •  Online aggregation: group by aggregation, wander join

•  Interactive SQL

q Interactive Data Exploration •  Faceted search

•  Semantic windows

•  Explore by example

Page 30: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

3. Interactive Data Exploration

Explore-by-example: an automatic query steering framework for interactive data exploration

Kyriaki Dimitriadou, Olga Papaemmanouil, Yanlei Diao SIGMOD ’14

Page 31: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Web Applications: e.g., House Hunting

2+ bedrooms, ≤ 0.5 million

2+ bedrooms, price depending on the size

2+ bedrooms, price depending on the size, location

2+ bedrooms, price depending on the size, location, quietness (away from street, higher floor)

2+ bedrooms, price depending on the size, location, quietness, school district, building structure…

Page 32: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

An Explore-by-Example Framework

Query Formulation

Relevant examples

Irrelevant examples User Model

Final Data Extraction

Query

User Model

Relevance Feedback

Sample

Classification Model

Exploration Queries

Space Exploration

Initial d attributes

Sample Acquisition

Page 33: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

(1) Initial Sampling

Analyze sampling methods in this new problem setting: ² Random sampling ² Equi-width stratified sampling ² Equi-depth stratified sampling ² Progressive sampling [Dimitriadou 2014]

How to produce initial examples that fall in the user interest area?

Page 34: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Equi-Width Stratified Sampling

Figure 2: equi-width stratified sampling

bucket according to feature F2, and divide each bucket into c sub-buckets, so the number of points withineach sub-bucket is almost the same. We do this for d dimensions, and we know that after this process, eachgrid will have nearly the same number of data points.

Algorithm 3 Equi-depth Sampling1: procedure SELECT_EQUIDEPTH(k)2: c d

pk

3: Bucket_Set {S}4: for i = 1 to d do5: New_Bucket_Set {}6: for each bucket b in Bucket_Set do7: sort data points in b according to feature Fi

8: divide b into c sub-buckets b1, b2, ..., bc according to Fi, so that |b1| = |b2| = ... = |bc|9: New_Bucket_Set.append(b1, b2, ..., bc)

10: Bucket_Set New_Bucket_Set11: j 112: for each bucket b in Bucket_Set do13: samples[j] one random sample from b14: j j + 1

return samples[1...k]

We illustrate the algorithm of equi-depth stratified sampling through an example in Figure 3. In the example,we divide the data space into 16 grids, and each grid has nearly the same number of data points. Then we

4

Given a database of N data points in D-dim space, a true user interest in a d*-dim subspace •  Start with a user-chosen d-dim subspace, d* ≤ d << D •  Consider generating k samples •  Want to create c buckets per dimension, cd ≥ k •  Generate 1 sample per cell, [ l1 , h1 ] x [ l2 , h2 ] … x [ lc , hc ]

Equi-width stratified sampling §  c buckets are of equal width

Page 35: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Equi-Depth Stratified Sampling

Figure 3: equi-depth stratified sampling

draw one random sample from each grid.

2.4 progressive sampling

Progressive sampling algorithm 4 is to perform equi-width or equi-depth stratified sampling level-by-level.In the first level, we divide each dimension into 2 buckets, so we have 2d grids, and we select a randomsample from each grid. When we go to the second level, we divide each dimension into 22 buckets, sowe have 22d grids. Then we select one random sample from each of these smaller grids. We continue thissampling process until we get one sample from the user interest area.

Algorithm 4 Progressive Sampling1: procedure PROGRESSIVE_SAMPLING2: sample_set = {}3: level 14: while no point in sample_set is within user interest area do5: k 2level⇤d

6: samples SELECT_EQUIWIDTH(k) or SELECT_EQUIDEPTH(k)7: sample_set.append(samples)8: level level + 1

return sample_set

5

Given a database of N data points in D-dim space, a true user interest in a d*-dim subspace •  Start with a user-chosen d-dim subspace, d* <= d << D •  Consider generating k samples •  Want to create c buckets per dimension, cd >= k •  Generate 1 sample per cell, [ l1 , h1 ] x [ l2 , h2 ] … x [ lc , hc ]

Equi-depth stratified sampling §  c buckets have (roughly) equal

numbers of data points

Page 36: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Red

wav

elen

gth

Green Wavelength

Progressive Sampling: dynamic # examples

x

x

x

x

x x

x x

x

x

x

x

x

x

x x

Page 37: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Classification based on Decision Trees

red

red<=14.82 red>14.82

red Irrelevant

Irrelevant green

red<13.55 red>=13.55

green<=13.74

Relevant Irrelevant

green>13.74

SELECT * FROM galaxy WHERE red<= 14.82 AND red>= 13.5 AND

green<=13.74

Sample Red Green UserLabel

ObjectA 13.67 12.34 Yes

ObjectB 15.32 14.50 No

.. .. .. ...

ObjectX 14.21 13.57 Yes

Decision Tree Classifier

Page 38: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

(2) Recover from Misclassified Samples R

ed w

avel

engt

h

Green Wavelength

x

x

x

x

x x

x x

x

x

x

x

x

x

x x

False negative

False positive Predicted Area

False negative

Page 39: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Handling Misclassified Samples R

ed w

avel

engt

h

Green Wavelength

x

x

x

x

x x

x x

x

x

x

x

x

x

x x

√ √

Sampling Areas

√ √ √

√ x √ √

Page 40: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

Clustering-based Sampling

56

Red

wav

elen

gth

Green Wavelength

x √

√ √ √

√ √ √

√ √

Clusters-Sampling

Areas

√ √

√ x x

x √ √ x

x

x

x

Page 41: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide

(3) Refine Boundaries of Relevant Areas

57 57 57

Red

wav

elen

gth

Green Wavelength

Sampling Areas