An optimal and progressive algorithm for skyline queries slide

Post on 15-Jul-2015

111 views 2 download

Tags:

Transcript of An optimal and progressive algorithm for skyline queries slide

INI Lab.

An Optimal and Progressive Algorithm for Skyline QueriesDimitris Papadias, Yufei Tao, Greg Fu, Bernhard Seeger

ACM SIGMOD’ 2003

PresentersKYEONG SEOK HYUN,

WOO-SUNG CHOI,

JA-YEON KIM,

Ab

stra

ct

An Optimal

and Progressive Algorithm

for Skyline Queries

Using R-Tree

con

ten

ts1. Introduction

2. Related Work

2.1 Block Nested Loop (BNL)

2.5 Nearest Neighbor (NN)

3. Branch and Bound Skyline Algorithm

With I/O analysis

5. Experimental Evaluation

Skyline

Problem definition

Wh

ich

on

e d

o yo

u p

refe

r?

http://www.huffingtonpost.kr/2014/11/13/story_n_6150254.html

http://drmoontv.blogspot.kr/2013/03/blog-post_17.html

http://emperia.egloos.com/m/2516211

5,000 Won

40,000 Won

4,500 Won

http://flickrhivemind.net/User/Trollface%20T-Shirts/Interesting

혜자>> 창렬

pre

lim

ina

ries

Formal definition of Dominates (≪)

Given a set of d-dimensional points 𝑇

We say that a point t1 ∈ 𝑇 DOMINATES another point t2 ∈ 𝑇

If and only if

∀𝑖 ∈ 1, 2, 3, … , 𝑑 , 𝑡1 𝑖 ≧ 𝑡2[𝑖]

∃𝑗 ∈ 1, 2, 3, … , 𝑑 , 𝑡1 𝑗 > 𝑡2[𝑗]

and Denoted by t2 ≪ t1

(simply saying, t1 이이득)

Definition from http://www.comp.nus.edu.sg/~atung/publication/k_dominant.pdf

Note thatthe meaning of ‘dominates’ may differ

according to type of application

Wh

ich

on

e d

o yo

u p

refe

r?

http://www.huffingtonpost.kr/2014/11/13/story_n_6150254.html

http://drmoontv.blogspot.kr/2013/03/blog-post_17.html

http://emperia.egloos.com/m/2516211

5,000 Won

40,000 Won

4,500 Won

4,500 Won

http://flickrhivemind.net/User/Trollface%20T-Shirts/Interesting

Still혜자 >> 창렬

Hotel(attraction, 1/price, 1/distance)

Two Hotel

A : `80`, `1/15,000`, `1/500m`

B : `30`, `1/20,000`, `1/1500m`

𝐵 ≪ 𝐴

Why?

30<80

1/20,000 < 1/15,000

1/1,500m < 1/500m

A

1/p

rice

attraction

BAB

Dominates!

for example,

Very important

Pro

ble

m D

efin

itio

n(m

ath

ema

tica

l)The Skyline operator

Input - Given a set of objects P = {𝑝1, 𝑝2, … , 𝑝𝑁}

Output – {𝑝𝑖| 𝑝𝑖 ∈ 𝑃 𝑎𝑛𝑑 ∄ 𝑝∗ ∈ 𝑃 𝑠. 𝑡. 𝑝𝑖 ≪ 𝑝∗}

A

B

C

D

E

F

Dominating Area(B)

x axis

yax

is

G

Common misconceptions“𝐵 ∈ 𝑂𝑢𝑝𝑢𝑡 s𝑖𝑛𝑐𝑒 𝐵 ≫ 𝐶 , D, F” , wrong

“𝐵 ∈ 𝑂𝑢𝑝𝑢𝑡, s𝑖𝑛𝑐𝑒 𝑛𝑜 𝑜𝑡ℎ𝑒𝑟 𝑝𝑜𝑖𝑛𝑡 𝑃 ≫ 𝐵”, correct

Naïve approach

for processing skyline queries

Exh

aust

ive

Test

Suppose there are n objects in the given set

𝐷𝑥 = {𝑜1, 𝑜2, … , 𝑜𝑛}

Algorithm -Naïve 1

𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑥 ∈ 𝐷

𝑏𝑜𝑜𝑙𝑒𝑎𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑓𝑎𝑙𝑠𝑒

𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑦 ∈ 𝐷

𝑖𝑓 ¬(𝑜𝑥 = 𝑜𝑦) 𝐴𝑁𝐷 ¬ 𝑜𝑥 ≪ 𝑜𝑦 𝑡ℎ𝑒𝑛 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑒;

𝑒𝑙𝑠𝑒

𝑡ℎ𝑒𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑡𝑟𝑢𝑒;

break;

𝑖𝑓 ! 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 𝑆 ∪ {𝑜𝑥} A

B

C

D

E

F

G

Suppose there are n objects in the given set

𝐷𝑥 = {𝑜1, 𝑜2, … , 𝑜𝑛}

Algorithm -Naïve 1

𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑥 ∈ 𝐷

𝑏𝑜𝑜𝑙𝑒𝑎𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑓𝑎𝑙𝑠𝑒

𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑦 ∈ 𝐷

𝑖𝑓 ¬(𝑜𝑥 = 𝑜𝑦) 𝐴𝑁𝐷 ¬ 𝑜𝑥 ≪ 𝑜𝑦 𝑡ℎ𝑒𝑛 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑒;

𝑒𝑙𝑠𝑒

𝑡ℎ𝑒𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑡𝑟𝑢𝑒;

break;

𝑖𝑓 ! 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 𝑆 ∪ {𝑜𝑥}

Exh

aust

ive

Test

Nes

ted

Lo

op

Str

uct

ure

Modification: (Algorithm -Naïve 2)

Idea 1. Use Nested Loop StructureIdea 2. Take advantage of ‘Block-transfer’

towards better re-usability!

Block A

Block B

A

B

C

D

E

F

G

The Inherited Limitation of these approaches

1. It needs full-scan over the data

2. Though, query result containsonly a small fraction of the dataset

3. That is, these approaches are wasteful

R-Tree Index Approach

for processing skyline queries

Pre

lim

ina

ries

R-Tree

Nearest Neighbor Query

Pre

lim

ina

ries

R-Tree: Balanced tree for indexing multi-dimensional object

Support Dynamic operation (insert, update, delete)

R-Tree Index Approach

R-TreeVS

B-Tree

B+-Tree

Balanced

Requiring that all leaves be at the

same depth

Leaf nodes contain one

dimensional value

R-Tree

Similar to B+-Tree

Leaf nodes contain d-dimensional

value

http://courses.cs.washington.edu/courses/cse444/09sp/hw/hw3/hw3.html

R-Tree Index Approach

Spatial objects (or d-dimensional objects or geometric objects)

d-dimensional object? R-Tree Used for the Organization of

a set of d-dimensional objects

How? Main Idea

Minimum Bounding Rectangles (MBRs)

http://caversham.otago.ac.nz/research/geog.php

<Objects in 2-dimension space>

Qu

izWhat is the minimum number of points for representing

a rectangle?

Assumption: each rectangle is parallel to the coordinate axes

18

6 8

4

7

x

y

0

R-Tree Index Approach

Demonstration

R-Tree Simulator

Nea

rest

Nei

ghb

or

(NN

) Q

uer

y P

roce

ssin

g u

sin

g R

-Tre

e

Nearest Neighbor Query

Input

Given a set of objects P = {𝑝1, 𝑝2, … , 𝑝𝑁}

Query Point - q

Output – {𝑝𝑖| 𝑝𝑖 ∈ 𝑃 𝑎𝑛𝑑 ∄ 𝑝∗ ∈ 𝑃 𝑠. 𝑡. 𝐿𝑝 𝑝𝑖 , 𝑞 > 𝐿𝑝(𝑝∗, 𝑞)}

0 x

y

See how it works in appendix

R-Tree Index Approach

0 x

y

MINDIST(X, 0) MINDIST(X,1)

MINMAXDIST(X, 0)

MINMAXDIST(X,1)

0 1Root node

Key

ID

EA

!Pruning!

http://www.installitdirect.com/blog/easy-tips-for-pruning-your-plants/

http://ko.aliexpress.com/store/category/pruning-tools/519349_100005637.html

http://www.davey.com/

Back to the original question

Skyline with R-Tree

R-T

ree

Ind

ex A

pp

roac

h Let’s process skyline objects using R-Tree

Strategy 1 – Use traditional tech. (i.e. NN Query)

Strategy 2 – This paper

Strategy 1

Partition the data using NN Query recursively

Distance metric: 𝐿1 𝑛𝑜𝑟𝑚

First NN Query -> start from the ideal point (i.e. zero point)

Strategy 1

Recursive NN Query

Dominating Area(i)

exa

mp

lea

x axis

yax

is b

c

d

e

f

g

i m

n

k

i

IDEAL

i

To-do Area 1

To-do Area 2

exa

mp

lea

x axis

yax

is b

i

k

IDEAL

i

Dominating Area(i)

TO-DO Area 2

TO-DO Area 1

To-do Area 2To-do Area 2

To-do Area 1

exa

mp

lea

x axis

yax

is b

i

k

i

Dominating Area(i)

TO-DO Area 1

TO-DO Area 2Dominating Area(k)k

IDEAL

``

Next, test these area (only to find nothing)

To-do Area 1

exa

mp

le

x axis

i

k

i

Dominating Area(i)

TO-DO Area 1

Dominating Area(k)

To-do Area 1

k

a

yax

is b

IDEAL

a

Dominating Area(a)

Dominating Area(k)

Result

Dominating Area(i)

IDEAL

Dominating Area(a)

x axis

yax

is

i

k

i

k

aa

Lim

ita

tion

of

Str

ate

gy 1

Generally speaking,

In a d-dimensional space,

Each skyline object discovered causes d recursive partitioning phase

Dominated

Lim

ita

tion

of

Str

ate

gy 1

Generally speaking,

In a d-dimensional space,

Each skyline object discovered causes d recursive partitioning phase

Area 1

Dominated

Area 2

Dominated

Area 3

Dominated

What if?

In general, for d>2

The overlapping of the partitions

Necessitates DUPLICATE ELIMINATION

Area 1

Dominated Area

2

Dominated

Area 3

Dominated

Dis

ad

van

tage

! Strategy 1 needs an additional phase

For removing redundant outputs

4 elimination methods

Laisser-faire

Propagate

Merge

Fine-grained Partitioning

They works

Problem: sub-optimal

Strategy 2

Branch & Bound Skyline Algorithm

Idea!

Similar to previous NN Query

Branch & Bound Skyline (BBS)

http://greatleadersserve.org/leadership/big-idea-great-leaders-serve/

h

example

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1Root

Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

example

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1

L1E1 L1E2

Queue

L1E2, 4 L1E1, 10

RootPtr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

Result

example

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1

L1E2, 4 L1E2

Queue

L2E2, 5

L1E1, 10

L2E3, 7 L2E4, 8

3 5 7

2

1

9

1

RootPtr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

Result

example

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1Root

Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

Queue

3 5 7

2

1

9

1

L2E2, 5 L1E1, 10L2E3, 7 L2E4, 8

c, 12 h, 7 i, 5

Result

example

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1Root

Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

Queue

3 5 7

2

1

9

1

L1E1, 10L2E4, 8 c, 12h, 7i, 5

Result

L2E3, 7

example

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1Root

Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

Queue

3 5 7

2

1

9

1

L1E1, 10L2E4, 8 c, 12h, 7

i, 5

Result

L2E3, 7

example

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1Root

Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

Queue

3 5 7

2

1

9

1

L1E1, 10L2E4, 8 c, 12

i, 5

Result

k, 10 f n i

example

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1Root

Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4

L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4

L2E1a b c null

L2E2c h i null

L2E3d g m null

L2E4f k l n

Queue

3 5 7

2

1

9

1

i, 5

Result

a, 10 k, 10

Analysis

Strategy 1

An

alys

iso

f Str

ateg

y 1

Notation

Variable Description

s #of Skyline obj

e Empty Query

ne Non-empty Query

r Redendent Query

d d-dimension

h Height of the given R-Tree

Recursion Tree

d new recursive NN

… …

𝑛𝑒 = 𝑠 + 𝑟

𝑒 = 𝑛𝑒 ∙ 𝑑 − 1 + 1, 𝑠𝑖𝑛𝑐𝑒 𝑛𝑒 + 𝑒 = 𝑛𝑒 ∙ 𝑑 + 1(𝑟𝑜𝑜𝑡)

𝑒 = 𝑠 + 𝑟 𝑑 − 1 + 1

𝑁𝐴𝑁𝑁 ≥ 𝑒 + 𝑠 + 𝑟 ∗ ℎ = 𝑠 + 𝑟 𝑑 − 1 + 1 + 𝑠 + 𝑟 ℎ > 𝑠 ∙ ℎ ∙ 𝑑

Analysis

Strategy 2

An

alys

iso

f Str

ateg

y 2

(bri

ef v

ersi

on)

Notation

Variable Description

s #of Skyline obj

h Height of the given R-Tree

𝑠 ∙ ℎ ≥ 𝑁𝐴𝐵𝐵𝑆

𝑁𝐴𝑁𝑁 > 𝑠 ∙ ℎ ∙ 𝑑 > 𝑁𝐴𝐵𝐵𝑆

Is it the optimal solution?

BBS Algorithm

Proof 1.

Termination&

Correctness

Lemma 1. BBS visits entries in ascending order

Of their distance to the ‘ideal point’

Lemma 2. Any data point added into Result_Set

Is guaranteed to be a final skyline point

Proof.

Suppose not then 𝑝𝑗 was added into Result_Set but not a final skyline point

Then, ∃ 𝑝∗ ∈ 𝐷𝐵 𝑠. 𝑡, 𝑝∗ ≫ 𝑝𝑗 , which means L1 ideal, p∗ < L1(ideal, pj)

However, observe that 𝑝∗ must be visited before 𝑝𝑗 by lemma 1.

Contradiction: 𝑝𝑗 should have been pruned, which contradicts the assumption.

Lemma 3. All data point will be examined, unless one of its ancestor

nodes has been pruned.

Lem

ma

s fo

r th

e th

eore

m Lemma 4. Any skyline algorithm

based on R-Tree must access all the

nodes whose mbrs intersects the SSR

Lemma 5. If an entry e doesn’t

intersect the SSR

Then ∃𝑝∗ 𝑠. 𝑡. 𝐿1 𝑖𝑑𝑒𝑎𝑙, 𝑝∗ <

𝐿1(𝑖𝑑𝑒𝑎𝑙, 𝑒. 𝑙𝑒𝑓𝑡𝑑𝑜𝑤𝑛)

Theorem: The # of node accesses

performed by BBS is OPTIMAL

A

B

C

D

E

F

Do

min

atin

g A

rea(

B)

x ax

is

yaxis

G

SSR

Pro

of o

f th

e th

eore

mProof 1. BBS only accesses nodes that

may contain skyline points.

That is, BBS only accesses nodes

whose mbrs intersect the SSR

Suppose not

Node e that doesn’t intersect the SSR

∃𝑝∗ by lemma 5

Contradicts, by lemma 1

Proof 2. BBS visits nodes at most

once. (trivial)

A

B

C

D

E

F

Do

min

atin

g A

rea(

B)

x ax

is

yaxis

G

SSR

To q

uan

tify

th

e ac

tual

co

st

Skip the details A

B

C

D

E

F

Dominating Area(B)

x axis

yax

is

G

SSR

Experimental Evaluation

Exp

erim

enta

l E

valu

ati

on

Dim

ensi

on

alit

y

Car

din

alit

y3d dataset

Pro

gres

sive

beh

avio

rN=1M, d=3

Co

nst

rain

ed

skyl

ine

qu

erie

sN=1M, d=3

h

a

x axis

yax

is b

c

d

e

f

g

i m

n

k

l

IDEAL

L1E2

L1E1

L2E4

L2E2

L2E3

L2E1

Constrain