Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June...
-
Upload
bret-burger -
Category
Documents
-
view
212 -
download
0
Transcript of Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June...
![Page 1: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/1.jpg)
Partition Based Spatial – Merge Join
Present by:Tony Tong (09049620)
Cleo Tsang (09049630)June Yau (09030360)
Chelsie Chan (10104740)Oengus Lam (10104790)
1
![Page 2: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/2.jpg)
Agenda1. Problem2. Definition of Spatial Join3. PBSM Algorithm
3.1 Filter Step3.2 Refinement Step3.3 Number of Partition3.4 Spatial Partitioning Function
4. Performance4.1 Indexed Nested Loops Join4.2 R-tree Based Join Algorithm4.3 Methodology4.4 None of Indices Pre-exist4.5 In the Presence of Pre-existing Index4.6 CPU Costs
5. Conclusion
2
![Page 3: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/3.jpg)
1. Problem
• In a spatial database system, like GIS, join queries objects involve large amount of memory
• Have no pre computing data for datasets• Usually no index for intermediate result • Solve this join problem efficiently
3
![Page 4: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/4.jpg)
2. Definition of Spatial Join
• An operation of combining 2 or more datasets based on their spatial relationship
Q: “Find all pairs of rivers and cities that intersect”
Join Result Pairs:<r1,c1><r1,c2><r2,c5>r2
r1
c1
c2
c3
c4
c5
4
![Page 5: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/5.jpg)
3. PBSM Algorithm
• Partition Based Spatial-Merge Join (PBSM)• PBSM operates in 2 steps
5
Filter Step
Refinement Step
ID River_Name Length
r1 Margaret River 60km
r2 Brisbane River 344km
ID City_Name County
c1 Perth WA
c2 Brisbane QLD
Unique Identifer(OID)
Input R
Input C
![Page 6: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/6.jpg)
3.1 Filter Step• Purpose:
– To find all objects whose MBR intersects the query rectangle
• For each input (R and C), – Creation of Minimum Bounding
Rectangle (MBR)
– Rough Estimation for Search Region
– Key-Pointer Element in New Input (Rkp and Ckp)(OID + MBR)
6
Input Rkp
<r1, mr1><r2, mr2>
r2r1
c1
c2
c3
c4
c5
Input Ckp
<c1, mc1><c2, mc2>
...
Key-Pointer Element
![Page 7: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/7.jpg)
3.1 Filter Step• Spatial Join (1st Scenario)
– Rkp and Ckp fit into main memory
• Plane-Sweeping Technique– Sort by MBR.xl for Each Input
(Rkp and Ckp)
– Select the MBR in Either Input (e.g. Rkp) with Smallest MBR.xl
– Scan along the x-axis from MBR.xl to MBR.xu to check if MBRr ∩ MBRc
7
r2r1
c1
c2
c3
c4
c5
MBR.xlMBR.xu
Start with the first entry r1, sweep a vertical lineCheck if MBRr1 ∩ MBRc2,add (OIDr1, OIDc2) to result setCheck if MBRr1 ∩ MBRc1,add (OIDr1, OIDc1) to result setScan until MBR.xu,start the next entry
![Page 8: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/8.jpg)
3.1 Filter Step• Spatial Join (2nd Scenario)
– Rkp and Ckp do not fit into main memory
• Spatial Partitioning Technique– Size of Each Partition for both Input
(Rkp and Ckp) can fit into memory simultaneously
– Perform Plane-Sweeping Technique for Preliminary Spatial Join in Each Partition
• Result Pair <OIDR, OIDC>
8
r2r1
c1
c2
c3
c4
c5
Partition 2 Partition 3
Partition 0 Partition 1
![Page 9: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/9.jpg)
3.2 Refinement Step• Purpose
– #1: To eliminate duplicates induced by Partitioning
9
r2c4
c5
Partition 2 Partition 3
Partition 0 Partition 1
Result Pairs:Partition 1: <r2,c5>Partition 3: <r2,c5>, <r2,c4>
– #2: To examine the actual R and S tuples & see if the attributes satisfy join condition
![Page 10: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/10.jpg)
3.2 Refinement Step• Procedure
– #1: Sort OID pairs:• Primary Sort Key: OIDR
• Secondary Sort Key: OIDC
– #2: Read R tuples first, then C
![Page 11: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/11.jpg)
3.3 Number of Partitions• Number of Partition P is computed as:-
whereP: Number of partition
R : Cardinality of RC : Cardinality of C
Sizekey-ptr: Size of a key-pointer element (in bytes)M: Size of main memory (in bytes)
11
M
SizeCRP ptrkey*)(
![Page 12: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/12.jpg)
3.4 Spatial Partitioning Function • Non-Uniform Distributed & Clustered
Spatial Features
• By “Regular” Partitioning Method Large differences in size of partitions
12
Partition 0 Partition 1
Partition 2 Partition 3Universe
![Page 13: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/13.jpg)
3.4 Spatial Partitioning Function Step 1: Regular decomposition of universe into
NT tiles, where NT > = PStep 2: Apply Tile-to-Partition Mapping Scheme
Round Robin OR Hashing
13
“Regular” Partitioning Method “Tile-based” Partitioning Method + Round Robin Mapping Scheme
Partition 0 Partition 1
Partition 2 Partition 3
Tile0/Part0 Tile1/Part1 Tile2/Part2 Tile3/Part0
Tile4/Part1 Tile5/Part2
Tile6/Part0 Tile7/Part1
Tile8/Part2 Tile9/Part0 Tile10/Part1 Tile11/Part2
![Page 14: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/14.jpg)
3.4 Spatial Partitioning Function • What is the PERFECT Spatial Partitioning
Function ?
• Considerations:– Number of Tiles– Tile-to-Partition Mapping Scheme
(Round Robin OR Hashing)
• Data set used for investigation:– Tiger Road Data (62.4MB, 456,613 tuples)– Sequoia Polygon Data (21.9MB, 58,115 tuples)
14
It assigns equal number of tuples to each partition
![Page 15: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/15.jpg)
3.4 Spatial Partitioning Function • Observation:
Partitioning Function improves as No. of Tiles increases More uniform distribution
15
The PERFECT Partitioning Function has a coefficient of variation = 0
Spatial Partitioning FunctionAlternatives: Tiger Road Data
![Page 16: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/16.jpg)
3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,
↑ Replication Overhead
16
Replication Overhead:Tiger Road Data (16 Partitions)
Replication Overhead: Sequoia Polygon Data(16 Partitions)
Number of Tiles = An integral multiple of Number of Partitions
![Page 17: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/17.jpg)
Tile0/Part0 Tile1/Part1 Tile2/Part2
Tile3/Part0 Tile4/Part1 Tile5/Part2
Tile6/Part0 Tile7/Part1 Tile8/Part2
r1
c1
c2
Scenario:• No. of Tiles = 9• P = 3• Tile-to-Partition Mapping Scheme = Round Robin
The entire column is being mapped to a single partition ↓ ∴ Replications by partitioning, ↓ Replication overheads
![Page 18: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/18.jpg)
3.4 Spatial Partitioning Function • Observation: ↑ No. of Tiles,
↑ Replication Overhead
18
Replication Overhead:Tiger Road Data (16 Partitions)
Replication Overhead: Sequoia Polygon Data(16 Partitions)
Number of Tiles = An integral multiple of Number of Partitions
![Page 19: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/19.jpg)
4. Performance
V.SV.SIndexed
Nested Loops Join
Indexed Nested Loops
Join
PBSM Join(1024 tiles)PBSM Join(1024 tiles) V.SV.S
R-tree Based Join
R-tree Based Join
19
![Page 20: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/20.jpg)
4.1 Index Nested Loops Join
20
Build an index in R (the smaller input)• Reads the extent R• Extracts the key-pointer (<MBR,OID>)
Sort the key-pointer by MBR
Build R-tree for the key-pointer
Scan on C• For each C, fetch each R
![Page 21: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/21.jpg)
4.2 R-tree Based Join Algorithm
21
Build an R-tree index in both R and C
Find MBR with union set which is not null• Visit the roots• Move down the levels until leaf nodes
Find ID pairs with dataset union which is not null
![Page 22: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/22.jpg)
4.3 Methodology
• Database System: Paradise• Machine: Sun SPARC-10/51
– 64 MB of memory– SunOS Release 4.1.3– One Seagate 2GB disk
22
![Page 23: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/23.jpg)
TIGER file
• Road, Hydrography and Rail data of the United States etc…
• 2 join queries– Road with Hydrography– Between the Road and the Rail data
23
Data Type # of Objects Total Size R-tree Size
Road 656,613 62.4 MB 24.0 MB
Hydrography 122,149 25.2 MB 6.5 MB
Rail 16,844 2.4 MB 1.0 MB
![Page 24: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/24.jpg)
Sequoia 2000 Storage Benchmark
• Polygon– Regions of homogeneous landuse characteristics
in California• Islands
– Holes in the polygon data
24
Data Type # of Objects Total Size R-tree Size
Polygons 58,115 21.9 MB 3.0 MB
Islands 21,007 6.2 MB 1.1 MB
![Page 25: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/25.jpg)
4.4 NONE OF INDICES PRE-EXIST
25
![Page 26: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/26.jpg)
TIGER: Join Road with Hydrograhy
26
PBSM is 48-98% faster
than the R-Tree Based; 93-
300% faster than the Idx.
Nested Loops.
![Page 27: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/27.jpg)
TIGER: Join Road with Rail
27
Rail data: 2.4MB (Index: 1.0MB), fits in buffer pool; Idx. Nested
Loops performs
better than R-Tree Based.
![Page 28: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/28.jpg)
Cluster
• Data continuously , i.e. not randomly distributed
• Data are mostly in sequential order in real life
• Less computationally expensive
28
![Page 29: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/29.jpg)
Clustered TIGER: Join Road with Hydrography
29
PBSM is 40% faster than the R-Tree Based; and 60-80%
faster than the Idx. Nested
Loops.
![Page 30: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/30.jpg)
Costs• Index Building Cost
– Cost of extracting the key-pointers from the input– Sorting the key-pointers– Building the index using the sorted key pointers– If Input is clustered
No sorting key-pointers Cost of building index
• Tree Joining Cost• Refinement Step Cost
30
![Page 31: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/31.jpg)
Sequoia Data
31
PBSM is 13-27% faster
than the R-Tree Based; and 17-
114% faster than the Idx.
Nested Loops.
![Page 32: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/32.jpg)
Summary
• PBSM is better than R-tree and the Indexed Nested Loops based algorithm
• When sizes of 2 inputs differ significantly,Indexed Nested Loops is better than the R-tree based algorithm
• All algorithms improve if join inputs are clustered
32
![Page 33: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/33.jpg)
4.5 IN THE PRESENCE OF PRE-EXISTING INDEX
33
![Page 34: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/34.jpg)
When indices pre-exist on
both the inputs, the R-
tree based algorithm has
the best performance
When indices pre-exist on
both the inputs, the R-
tree based algorithm has
the best performance
TIGER: Join Road with Hydrograhy
34
![Page 35: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/35.jpg)
When an index exists only on the smaller input, PBSM
performs best.
When an index exists only on the smaller input, PBSM
performs best.
TIGER: Join Road with Rail
35
![Page 36: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/36.jpg)
4.6 CPU Cost
• Insert a table• CPU cost > I/O cost• System
– CPU intensive– Much less I/O is needed
36
![Page 37: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/37.jpg)
5. Principal Behind
• Divide and Conquer• Optimization on memory size
![Page 38: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/38.jpg)
6. Playback of this presentation
• Efficient PBSM algorithm• Comparison among different algorithm• Performance Analysis• Clustered Data• Indexed Data
38
![Page 39: Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790)](https://reader036.fdocuments.us/reader036/viewer/2022070306/55178d5355034645368b5531/html5/thumbnails/39.jpg)
Question ?
39