An optimal and progressive algorithm for skyline queries slide
-
Upload
woosung-choi -
Category
Data & Analytics
-
view
111 -
download
2
Transcript of An optimal and progressive algorithm for skyline queries slide
INI Lab.
An Optimal and Progressive Algorithm for Skyline QueriesDimitris Papadias, Yufei Tao, Greg Fu, Bernhard Seeger
ACM SIGMOD’ 2003
PresentersKYEONG SEOK HYUN,
WOO-SUNG CHOI,
JA-YEON KIM,
Ab
stra
ct
An Optimal
and Progressive Algorithm
for Skyline Queries
Using R-Tree
con
ten
ts1. Introduction
2. Related Work
2.1 Block Nested Loop (BNL)
2.5 Nearest Neighbor (NN)
3. Branch and Bound Skyline Algorithm
With I/O analysis
5. Experimental Evaluation
Skyline
Problem definition
Wh
ich
on
e d
o yo
u p
refe
r?
http://www.huffingtonpost.kr/2014/11/13/story_n_6150254.html
http://drmoontv.blogspot.kr/2013/03/blog-post_17.html
http://emperia.egloos.com/m/2516211
5,000 Won
40,000 Won
4,500 Won
http://flickrhivemind.net/User/Trollface%20T-Shirts/Interesting
혜자>> 창렬
pre
lim
ina
ries
Formal definition of Dominates (≪)
Given a set of d-dimensional points 𝑇
We say that a point t1 ∈ 𝑇 DOMINATES another point t2 ∈ 𝑇
If and only if
∀𝑖 ∈ 1, 2, 3, … , 𝑑 , 𝑡1 𝑖 ≧ 𝑡2[𝑖]
∃𝑗 ∈ 1, 2, 3, … , 𝑑 , 𝑡1 𝑗 > 𝑡2[𝑗]
and Denoted by t2 ≪ t1
(simply saying, t1 이이득)
Definition from http://www.comp.nus.edu.sg/~atung/publication/k_dominant.pdf
Note thatthe meaning of ‘dominates’ may differ
according to type of application
Wh
ich
on
e d
o yo
u p
refe
r?
http://www.huffingtonpost.kr/2014/11/13/story_n_6150254.html
http://drmoontv.blogspot.kr/2013/03/blog-post_17.html
http://emperia.egloos.com/m/2516211
5,000 Won
40,000 Won
4,500 Won
4,500 Won
http://flickrhivemind.net/User/Trollface%20T-Shirts/Interesting
Still혜자 >> 창렬
Hotel(attraction, 1/price, 1/distance)
Two Hotel
A : `80`, `1/15,000`, `1/500m`
B : `30`, `1/20,000`, `1/1500m`
𝐵 ≪ 𝐴
Why?
30<80
1/20,000 < 1/15,000
1/1,500m < 1/500m
A
1/p
rice
attraction
BAB
Dominates!
≪
for example,
Very important
Pro
ble
m D
efin
itio
n(m
ath
ema
tica
l)The Skyline operator
Input - Given a set of objects P = {𝑝1, 𝑝2, … , 𝑝𝑁}
Output – {𝑝𝑖| 𝑝𝑖 ∈ 𝑃 𝑎𝑛𝑑 ∄ 𝑝∗ ∈ 𝑃 𝑠. 𝑡. 𝑝𝑖 ≪ 𝑝∗}
A
B
C
D
E
F
Dominating Area(B)
x axis
yax
is
G
Common misconceptions“𝐵 ∈ 𝑂𝑢𝑝𝑢𝑡 s𝑖𝑛𝑐𝑒 𝐵 ≫ 𝐶 , D, F” , wrong
“𝐵 ∈ 𝑂𝑢𝑝𝑢𝑡, s𝑖𝑛𝑐𝑒 𝑛𝑜 𝑜𝑡ℎ𝑒𝑟 𝑝𝑜𝑖𝑛𝑡 𝑃 ≫ 𝐵”, correct
Naïve approach
for processing skyline queries
Exh
aust
ive
Test
Suppose there are n objects in the given set
𝐷𝑥 = {𝑜1, 𝑜2, … , 𝑜𝑛}
Algorithm -Naïve 1
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑥 ∈ 𝐷
𝑏𝑜𝑜𝑙𝑒𝑎𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑓𝑎𝑙𝑠𝑒
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑦 ∈ 𝐷
𝑖𝑓 ¬(𝑜𝑥 = 𝑜𝑦) 𝐴𝑁𝐷 ¬ 𝑜𝑥 ≪ 𝑜𝑦 𝑡ℎ𝑒𝑛 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑒;
𝑒𝑙𝑠𝑒
𝑡ℎ𝑒𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑡𝑟𝑢𝑒;
break;
𝑖𝑓 ! 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 𝑆 ∪ {𝑜𝑥} A
B
C
D
E
F
G
Suppose there are n objects in the given set
𝐷𝑥 = {𝑜1, 𝑜2, … , 𝑜𝑛}
Algorithm -Naïve 1
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑥 ∈ 𝐷
𝑏𝑜𝑜𝑙𝑒𝑎𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑓𝑎𝑙𝑠𝑒
𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑜𝑏𝑗𝑒𝑐𝑡 𝑜𝑦 ∈ 𝐷
𝑖𝑓 ¬(𝑜𝑥 = 𝑜𝑦) 𝐴𝑁𝐷 ¬ 𝑜𝑥 ≪ 𝑜𝑦 𝑡ℎ𝑒𝑛 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑒;
𝑒𝑙𝑠𝑒
𝑡ℎ𝑒𝑛 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 = 𝑡𝑟𝑢𝑒;
break;
𝑖𝑓 ! 𝑖𝑠𝐷𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑 𝑆 ∪ {𝑜𝑥}
Exh
aust
ive
Test
Nes
ted
Lo
op
Str
uct
ure
Modification: (Algorithm -Naïve 2)
Idea 1. Use Nested Loop StructureIdea 2. Take advantage of ‘Block-transfer’
towards better re-usability!
Block A
Block B
A
B
C
D
E
F
G
The Inherited Limitation of these approaches
1. It needs full-scan over the data
2. Though, query result containsonly a small fraction of the dataset
3. That is, these approaches are wasteful
R-Tree Index Approach
for processing skyline queries
Pre
lim
ina
ries
R-Tree
Nearest Neighbor Query
Pre
lim
ina
ries
R-Tree: Balanced tree for indexing multi-dimensional object
Support Dynamic operation (insert, update, delete)
R-Tree Index Approach
R-TreeVS
B-Tree
B+-Tree
Balanced
Requiring that all leaves be at the
same depth
Leaf nodes contain one
dimensional value
R-Tree
Similar to B+-Tree
Leaf nodes contain d-dimensional
value
http://courses.cs.washington.edu/courses/cse444/09sp/hw/hw3/hw3.html
R-Tree Index Approach
Spatial objects (or d-dimensional objects or geometric objects)
d-dimensional object? R-Tree Used for the Organization of
a set of d-dimensional objects
How? Main Idea
Minimum Bounding Rectangles (MBRs)
http://caversham.otago.ac.nz/research/geog.php
<Objects in 2-dimension space>
Qu
izWhat is the minimum number of points for representing
a rectangle?
Assumption: each rectangle is parallel to the coordinate axes
18
6 8
4
7
x
y
0
R-Tree Index Approach
Demonstration
R-Tree Simulator
Nea
rest
Nei
ghb
or
(NN
) Q
uer
y P
roce
ssin
g u
sin
g R
-Tre
e
Nearest Neighbor Query
Input
Given a set of objects P = {𝑝1, 𝑝2, … , 𝑝𝑁}
Query Point - q
Output – {𝑝𝑖| 𝑝𝑖 ∈ 𝑃 𝑎𝑛𝑑 ∄ 𝑝∗ ∈ 𝑃 𝑠. 𝑡. 𝐿𝑝 𝑝𝑖 , 𝑞 > 𝐿𝑝(𝑝∗, 𝑞)}
0 x
y
See how it works in appendix
R-Tree Index Approach
0 x
y
MINDIST(X, 0) MINDIST(X,1)
MINMAXDIST(X, 0)
MINMAXDIST(X,1)
0 1Root node
Key
ID
EA
!Pruning!
http://www.installitdirect.com/blog/easy-tips-for-pruning-your-plants/
http://ko.aliexpress.com/store/category/pruning-tools/519349_100005637.html
http://www.davey.com/
Back to the original question
Skyline with R-Tree
R-T
ree
Ind
ex A
pp
roac
h Let’s process skyline objects using R-Tree
Strategy 1 – Use traditional tech. (i.e. NN Query)
Strategy 2 – This paper
Strategy 1
Partition the data using NN Query recursively
Distance metric: 𝐿1 𝑛𝑜𝑟𝑚
First NN Query -> start from the ideal point (i.e. zero point)
Strategy 1
Recursive NN Query
Dominating Area(i)
exa
mp
lea
x axis
yax
is b
c
d
e
f
g
i m
n
k
i
IDEAL
i
To-do Area 1
To-do Area 2
exa
mp
lea
x axis
yax
is b
i
k
IDEAL
i
Dominating Area(i)
TO-DO Area 2
TO-DO Area 1
To-do Area 2To-do Area 2
To-do Area 1
exa
mp
lea
x axis
yax
is b
i
k
i
Dominating Area(i)
TO-DO Area 1
TO-DO Area 2Dominating Area(k)k
IDEAL
``
Next, test these area (only to find nothing)
To-do Area 1
exa
mp
le
x axis
i
k
i
Dominating Area(i)
TO-DO Area 1
Dominating Area(k)
To-do Area 1
k
a
yax
is b
IDEAL
a
Dominating Area(a)
Dominating Area(k)
Result
Dominating Area(i)
IDEAL
Dominating Area(a)
x axis
yax
is
i
k
i
k
aa
Lim
ita
tion
of
Str
ate
gy 1
Generally speaking,
In a d-dimensional space,
Each skyline object discovered causes d recursive partitioning phase
Dominated
Lim
ita
tion
of
Str
ate
gy 1
Generally speaking,
In a d-dimensional space,
Each skyline object discovered causes d recursive partitioning phase
Area 1
Dominated
Area 2
Dominated
Area 3
Dominated
What if?
In general, for d>2
The overlapping of the partitions
Necessitates DUPLICATE ELIMINATION
Area 1
Dominated Area
2
Dominated
Area 3
Dominated
Dis
ad
van
tage
! Strategy 1 needs an additional phase
For removing redundant outputs
4 elimination methods
Laisser-faire
Propagate
Merge
Fine-grained Partitioning
They works
Problem: sub-optimal
Strategy 2
Branch & Bound Skyline Algorithm
Idea!
Similar to previous NN Query
Branch & Bound Skyline (BBS)
http://greatleadersserve.org/leadership/big-idea-great-leaders-serve/
h
example
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1Root
Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
example
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1
L1E1 L1E2
Queue
L1E2, 4 L1E1, 10
RootPtr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
Result
example
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1
L1E2, 4 L1E2
Queue
L2E2, 5
L1E1, 10
L2E3, 7 L2E4, 8
3 5 7
2
1
9
1
RootPtr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
Result
example
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1Root
Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
Queue
3 5 7
2
1
9
1
L2E2, 5 L1E1, 10L2E3, 7 L2E4, 8
c, 12 h, 7 i, 5
Result
example
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1Root
Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
Queue
3 5 7
2
1
9
1
L1E1, 10L2E4, 8 c, 12h, 7i, 5
Result
L2E3, 7
example
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1Root
Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
Queue
3 5 7
2
1
9
1
L1E1, 10L2E4, 8 c, 12h, 7
i, 5
Result
L2E3, 7
example
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1Root
Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
Queue
3 5 7
2
1
9
1
L1E1, 10L2E4, 8 c, 12
i, 5
Result
k, 10 f n i
example
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1Root
Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E1Ptr 1 Ptr 2 Ptr 3 Ptr 4
L1E2Ptr 1 Ptr 2 Ptr 3 Ptr 4
L2E1a b c null
L2E2c h i null
L2E3d g m null
L2E4f k l n
Queue
3 5 7
2
1
9
1
i, 5
Result
a, 10 k, 10
Analysis
Strategy 1
An
alys
iso
f Str
ateg
y 1
Notation
Variable Description
s #of Skyline obj
e Empty Query
ne Non-empty Query
r Redendent Query
d d-dimension
h Height of the given R-Tree
Recursion Tree
…
d new recursive NN
… …
𝑛𝑒 = 𝑠 + 𝑟
𝑒 = 𝑛𝑒 ∙ 𝑑 − 1 + 1, 𝑠𝑖𝑛𝑐𝑒 𝑛𝑒 + 𝑒 = 𝑛𝑒 ∙ 𝑑 + 1(𝑟𝑜𝑜𝑡)
𝑒 = 𝑠 + 𝑟 𝑑 − 1 + 1
𝑁𝐴𝑁𝑁 ≥ 𝑒 + 𝑠 + 𝑟 ∗ ℎ = 𝑠 + 𝑟 𝑑 − 1 + 1 + 𝑠 + 𝑟 ℎ > 𝑠 ∙ ℎ ∙ 𝑑
Analysis
Strategy 2
An
alys
iso
f Str
ateg
y 2
(bri
ef v
ersi
on)
Notation
Variable Description
s #of Skyline obj
h Height of the given R-Tree
𝑠 ∙ ℎ ≥ 𝑁𝐴𝐵𝐵𝑆
𝑁𝐴𝑁𝑁 > 𝑠 ∙ ℎ ∙ 𝑑 > 𝑁𝐴𝐵𝐵𝑆
Is it the optimal solution?
BBS Algorithm
Proof 1.
Termination&
Correctness
Lemma 1. BBS visits entries in ascending order
Of their distance to the ‘ideal point’
Lemma 2. Any data point added into Result_Set
Is guaranteed to be a final skyline point
Proof.
Suppose not then 𝑝𝑗 was added into Result_Set but not a final skyline point
Then, ∃ 𝑝∗ ∈ 𝐷𝐵 𝑠. 𝑡, 𝑝∗ ≫ 𝑝𝑗 , which means L1 ideal, p∗ < L1(ideal, pj)
However, observe that 𝑝∗ must be visited before 𝑝𝑗 by lemma 1.
Contradiction: 𝑝𝑗 should have been pruned, which contradicts the assumption.
Lemma 3. All data point will be examined, unless one of its ancestor
nodes has been pruned.
Lem
ma
s fo
r th
e th
eore
m Lemma 4. Any skyline algorithm
based on R-Tree must access all the
nodes whose mbrs intersects the SSR
Lemma 5. If an entry e doesn’t
intersect the SSR
Then ∃𝑝∗ 𝑠. 𝑡. 𝐿1 𝑖𝑑𝑒𝑎𝑙, 𝑝∗ <
𝐿1(𝑖𝑑𝑒𝑎𝑙, 𝑒. 𝑙𝑒𝑓𝑡𝑑𝑜𝑤𝑛)
Theorem: The # of node accesses
performed by BBS is OPTIMAL
A
B
C
D
E
F
Do
min
atin
g A
rea(
B)
x ax
is
yaxis
G
SSR
Pro
of o
f th
e th
eore
mProof 1. BBS only accesses nodes that
may contain skyline points.
That is, BBS only accesses nodes
whose mbrs intersect the SSR
Suppose not
Node e that doesn’t intersect the SSR
∃𝑝∗ by lemma 5
Contradicts, by lemma 1
Proof 2. BBS visits nodes at most
once. (trivial)
A
B
C
D
E
F
Do
min
atin
g A
rea(
B)
x ax
is
yaxis
G
SSR
To q
uan
tify
th
e ac
tual
co
st
Skip the details A
B
C
D
E
F
Dominating Area(B)
x axis
yax
is
G
SSR
Experimental Evaluation
Exp
erim
enta
l E
valu
ati
on
Dim
ensi
on
alit
y
Car
din
alit
y3d dataset
Pro
gres
sive
beh
avio
rN=1M, d=3
Co
nst
rain
ed
skyl
ine
qu
erie
sN=1M, d=3
h
a
x axis
yax
is b
c
d
e
f
g
i m
n
k
l
IDEAL
L1E2
L1E1
L2E4
L2E2
L2E3
L2E1
Constrain