LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases
description
Transcript of LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases
Xiaodan Wang, Randal BurnsDepartment of Computer ScienceJohns Hopkins University
Tanu MalikCyber CenterPurdue University
LifeRaft: Data-Driven, Batch Processing for the Exploration of
Scientific Databases
LifeRaft: Data-Driven, Batch Processing
BETTER LUCK NEXT TIME!
LifeRaft: Data-Driven, Batch Processing
ProblemQ1
Q2
Q3
Q4
LifeRaft: Data-Driven, Batch Processing
GoalsEliminate redundant I/O to improve query throughput
Batch queries with that exhibit data sharing– Pre-process queries to identify data sharing– Co-schedule queries that access the same data– Access contentious data first to maximize sharing
Starvation resistance– Avoid indefinite queuing times (response time)– Enforce some constraints on completion order
LifeRaft: Data-Driven, Batch Processing
Target Applications Data intensive scan queries
– Executed against a clustered index– Clustered and federated databases (e.g. joins that correlate
multiple nodes) Peta-scale astronomy (Pan-STARRS)
– Data are partitioned spatially– Many queries scan full DB and last hours or days
Cross-match– Probabilistic spatial join across multiple databases
LifeRaft: Data-Driven, Batch Processing
Filter and Refine Filter queries
– Pre-process queries to determine join buckets– Buckets B1,…,Bn and queries Q1,…, Qm
– Workload Wij denote objects from Qi that overlap Bj
Refinement– Read buckets one-at-a-time– Sort-merge join (sort by HTM ID)– Query specific predicates applied on output tuples
LifeRaft: Data-Driven, Batch Processing
Workload Throughput Metric
Greedily in order of decreasing workload throughput Exploits data regions that experience contention May starve requests
– Favors buckets experiencing frequent reuse– No guarantee a particular bucket or query receives service
LifeRaft: Data-Driven, Batch Processing
Aged Workload Throughput Metric
Inspired by disk-drive head scheduling Balance arrival order (low response time) with
contention (high throughput) Adaptive trade-offs based on workload saturation
– Maximize rate at which objects are joined during saturated workloads
– Enforce completion order (queuing times) to prevent indefinite starvation during low saturation
LifeRaft: Data-Driven, Batch Processing
Scheduling Behavior
Qi – Qi1, Qi2, Qi3
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj Qk
Sub-divide queries by bucket:
Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8
Assumptions:- Inter-query time of 1 sec- I/O for each bucket of 1 sec- Cache size of 2- Join cost is negligibleQj – Qj5, Qj6 , Qj7, Qj8
Qk
LifeRaft: Data-Driven, Batch Processing
Arrival order with no sharing
Qi1
B1
Qi Arr
Qi2
B2
Qi3
B3
Qj1
B1
Qj Arr Qk Arr
Qj3
B3
Qi End
Qj4
B4
Qj6
B6
Qj7
B7
Qj8
B8
Qj End
Qk1
B1
Qk4
B4
Qk8
B8
Qk End
Qi – 3 secCompletion Times:
Qj – 8 sec Qk – 13 sec Avg – 8 sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
…
Tp – .2 qry/sec
LifeRaft: Data-Driven, Batch Processing
Age based scheduling (bias 1)
Qi1
B1
Qi Arr
Qi2
B2
Qi5
B5
Qi3Qj3
B3
Qj Arr Qk Arr Qi EndQj End
Qk End
Qj1Qk1
B1
Qj4Qk4
B4
Qj6Qk6
B6
Qi – 3 secCompletion Times:
Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
Qj8Qk8
B8
Qj7Qk7
B7
LifeRaft: Data-Driven, Batch Processing
Contention based scheduling (bias 0)
Qi1
B1
Qi Arr
Qi2
B2
Qi3Qj3
B3
Qj Arr Qk Arr Qi EndQj End
Qk5
B5
Qk End
Qj1Qk1Qj4Qk4
B1 B4
Qj6Qk6
B6
Qj7Qk7
B7
Qi – 7 secCompletion Times:
Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
Qj8Qk8
B8
(5.6) (.33)
LifeRaft: Data-Driven, Batch Processing
Throughput Performance
LifeRaft: Data-Driven, Batch Processing
Tuning theage bias
Throughput performance gap grows while response time gap is insensitive to saturation
Increasing age bias is more attractive at low saturation
LifeRaft: Data-Driven, Batch Processing
Parameter tuning using trade-off curves
LifeRaft: Data-Driven, Batch Processing
Discussion Impact of caching strategies Workload overflow
– Large intermediate join results– Migrate pairs of workload and bucket
Beyond completion order– Higher priority for interactive queries
Batch processing in a clustered environmentP. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.
LifeRaft: Data-Driven, Batch Processing
WHAT ABOUT US?
LifeRaft: Data-Driven, Batch Processing
Filter and refine Partition data into buckets
LifeRaft: Data-Driven, Batch Processing
Average Response Time
LifeRaft: Data-Driven, Batch Processing
Outline
Motivation– Goals for data-driven, batch scheduling– Target application (SkyQuery)
LiftRaft scheduler– Filter and refine queries– Throughput maximizing metric– Starvation resistance– Differences in outcomes
Workload adaptive parameter selection