Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009.
-
Upload
preston-kelly -
Category
Documents
-
view
216 -
download
4
Transcript of Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009.
1
Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu
Randomized Multi -pass Streaming Skyline Algorithm
Georgia Tech
VLDB 2009
2
In one sentence ….
3
“We develop a streaming algorithm
4
“We develop a streaming algorithm for skyline problem
5
“We develop a streaming algorithm for skyline problem with near-optimal worst-case guarantee.”
6
What is skyline?
7
Hotel Price DistanceAthena $97 2.9 km
Park & Suites $124 3.6 km
Hotel du Helder $76 3.8 km
de la Cité Concorde $220 0.67 km
Mercure Carlton Lyon $163 3.0 km
I want a cheap hotel
nearby
8
Hotel Price DistanceAthena $97 2.9 km
Park & Suites $124 3.6 km
Hotel du Helder $76 3.8 km
de la Cité Concorde $220 0.67 km
Mercure Carlton Lyon $163 3.0 km
I want a cheap hotel
nearbydo
min
ates
9
Hotel Price DistanceAthena $97 2.9 km
Park & Suites $124 3.6 km
Hotel du Helder $76 3.8 km
de la Cité Concorde $220 0.67 km
Mercure Carlton Lyon $163 2.9 km
I want a cheap hotel
nearbydo
min
ates
10
Price
Distance
de la Cite
Park & Suites
du HelderAthena
Mercure
11
Price
Distance
de la Cite
Park & Suites
du HelderAthena
Mercure
12
Problem definition
• Given distinct d-dimensional points• (a1, …, ad) dominates (b1, …, bd) if ai ≤ bi for all i
and ai’ < bi’ for some i’• Skyline = set of undominated points
dominatesSkyline = { (1, 3) , (3, 2) }
(5,2)
(1,3)
(3,2)
Example(1, 3) , (5, 2) , (3, 2)
13
Skyline algorithms
RAM Disk (External)
Preprocessing Non-preprocessingBBS Papadias et al. SIGMOD’03NN Kossman et al. VLDB’02
DD&C Kung et al. FOCS’ 75LD&C Bently et al. JACM’78, FLET Bently et al. SODA’90,
SD&C Borzsonyi et al. ICDE’01,BNL Borzsonyi et al. ICDE’01, SFS Chomicki et al. ICDE’03, LESS Godfrey et al. VLDB’05
14
Our Goal“Non-preprocessing external
algorithm with worst-case guarantee”
What is the model of external algorithms?
15
CPU process ≠ I/OSequental I/O ≠ Random I/O
Models for external algorithms
Multi-pass Streaming
Model
# of random I/O’s = # of passes
Streaming model naturally forces us to minimize the number of random I/O’s
16
What is multi-pass stream?
17
(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)
Small RAM
Huge Harddisk
18
(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)
Small RAM
Huge Harddisk
19
(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)
Small RAM
Huge Harddisk
20
(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)
Small RAM
Huge Harddisk
21
(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)
Small RAM
Huge Harddisk
2nd pass
22
(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)
Small RAM
Huge Harddisk
3rd pass
23
Our Goal
“Non-preprocessing external algorithm with worst-case guarantee”
streaming
24
Main resultsTheory
RAND: Almost optimal multi-pass streaming algorithm for skyline
O(log n) passes & O(m) space
n = # of points and m = skyline size
1 pass needs Ω(n) space
• RAND uses O(log n) passes & O(m) space• Every algorithm that uses 1 pass needs Ω(n) space
Next: RAND algorithmLater: Experimental result
25
RAND algorithm
26
Algorithms: Main Idea
Suppose m is known.Theorem: In 3 passes and m space, we
can find skyline points that “dominate” at least n/2 points, with high probability
Eliminate-Points algorithm
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(4, 4)27
28
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(4, 4)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
29
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(4, 4)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
30
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(4, 4)(3, 4)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
31
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
32
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
33
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
34
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
35
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
36
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
37
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
38
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
39
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
40
Eliminate-Points algorithm
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)(3, 3)
1. Sample x=2m ln(mn log n) points p1, p2, …, px
2. Go through the stream,Replace each pi by a point dominating it
3. For each pi, delete pi and all points it dominates
Output p1, p2, …, px and repeat
41
Analysis
Theorem: Eliminate-Points algorithm deletes at least n/2 points with high probability
42
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
43
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
44
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
Note: There will be m trees, each rooted by a skyline point
45
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(4, 4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
46
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
(3, 3)
47
4, 4
Analysis
• Claim: The tree that some element is sampled will be deleted
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
1, 5 3, 3
3, 4 4, 34, 5
(3, 3)
48
Analysis
• There are m trees, each rooted by a skyline point
1 2 mm-1
49
Analysis
• There are m trees, each rooted by a skyline point
1 2 mm-1
50
Analysis
• Big tree has bigger chance of being sampled… and deleted
1 2 mm-1
51
Analysis
• If enough points are sampled, every tree that is “big enough” will be deleted
1 2 mm-1
52
Analysis
Lemma: With high probability, all trees of size n/(2m) are deleted
• We delete n/2 points in total1 2 mm-1
53
Extending to RAND• Recall: If we know m then we can delete n/2 points
in 3 passes• If m is known, we can find skyline in O(log n)
passes with high probability– We delete n/2 points every 3 passes
• m is not known– Guess m by “doubling trick” – Additional O(log m) passes
• Fixed-window case – Memory space is limited
• Random I/O’s, Sequential I/O’s and Number of comparisons have to be analyzed separately
54
Main resultsTheory
RAND: Almost optimal multi-pass streaming algorithm for skyline
O(log n) passes & O(m) space
n = # of points and m = skyline size
1 pass needs Ω(n) space
• RAND uses O(log n) passes & O(m) space• Every algorithm that uses 1 pass needs Ω(n) space
55
TheoryRAND: Almost optimal multi-pass streaming algorithm for skyline
O(log n) passes & O(m) space
n = # of points and m = skyline size
1 pass needs Ω(n) space
Algorithms comparison w = window (memory) size
Main results
Algorithm Random I/O’s Sequential I/O’s ComparisonsBNL(w) Q(min{w, n/w}) Q(min{w, n2/w}) Q(dmin{wmn, n2})LESS(w) Q(n logw (n/w)) Q(mn/w) Q(dmn+n log n)
RAND(w) O(m log (n/w)) O(mn/w) O(dmn)
56
Main resultsExperiment RAND BNL & LESSvs
Average case
Worst case
We try several datasets in the literature …
Correlate, Anti-correlated, Independent,Island, House, NBA, Color
57
Average case- No clear winner between BNL and LESS- RAND is always close to the winner
Experimental Results
RAND BNL & LESS
Experimental Results
58RAND
“Worse”: After sorting by decreasing first coordinate- RAND is the most robust and usually fastest
BNL & LESS
Experimental Results
59RAND BNL & LESS
“Even Worse”: After sorting by “entropy”
Summary
60
(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)
60
RAND BNL & LESS
Average case
Worst case
Disk Stream
1 2 mm-1Random Sampling RAND
Experiment
61
Extensions• Distributed skyline algorithm• Derandomize the algorithm for 2D case• Skyline for partially ordered sets (posets)Open problems• Develop algorithm on Parallel Disk Model
(PDM) and Cache Oblivious model• Extend the techniques to pre-processing
algorithm• Is O(log n) passes the best possible?
Summary
62
Thank you
63
Appendix
64
Charts for average case
65
66
The lower bound
Theorem: Any randomized one-pass algorithm with space at most n/2 succeeds with probability at most 1/2
Proof- Random unique survivor- 2 points come at the end- If space <= n/2 then will fail if didn’t store survivor in the memory
67
Proof of Claim
68
Proof of Claim
• Claim: The tree that some element is sampled will be deleted
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
4, 4
(3, 3)
69
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(4, 4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
4, 4
70
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(4, 4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
4, 4
(3, 4)
3, 4
71
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
4, 4
(3, 3)
3, 4
3, 3
72
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
4, 4
(3, 3)
3, 4
3, 3
73
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
4, 4
(3, 3)
3, 4
3, 3
74
Analysis
• Draw trees: Each point points to its first dominating point
(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)
(3, 4)
1, 5 3, 3
3, 4 4, 3
4, 4
4, 5
4, 4
(3, 3)
3, 4
3, 3