1 Improving Hash Join Performance through Prefetching...
-
date post
21-Dec-2015 -
Category
Documents
-
view
226 -
download
5
Transcript of 1 Improving Hash Join Performance through Prefetching...
11
Improving Hash Join Improving Hash Join PerformancePerformance
through Prefetchingthrough Prefetching__________________________________________________________________________________________________
ByBySHIMIN CHENSHIMIN CHEN
Intel Research PittsburghIntel Research PittsburghANASTASSIA AILAMAKIANASTASSIA AILAMAKI
Carnegie Mellon UniversityCarnegie Mellon UniversityPHILLIP B. GIBBONSPHILLIP B. GIBBONS
Intel Research PittsburghIntel Research Pittsburghandand
TODD C. MOWRYTODD C. MOWRYCarnegie Mellon University and Intel Research PittsburghCarnegie Mellon University and Intel Research Pittsburgh
-Manisha Singh -Manisha Singh 22926972292697
22
OutlineOutline________________________________________________________
-- OverviewOverview
-- Proposed TechniquesProposed Techniques
-- Experimental setup Experimental setup
-- Performance evaluationPerformance evaluation
-- ConclusionConclusion
33
Hash JoinsHash Joins________________________________________________________________
- used in the implementation of a relational used in the implementation of a relational database management systemdatabase management system
- Two relation – build (small) and probe Two relation – build (small) and probe (large). (large).
- Excessive random I/OsExcessive random I/Os-If the build relation and hash table can not fit in -If the build relation and hash table can not fit in
memorymemoryBuild
RelationProbe Relation
Hash Table
44
Hash Join PerformanceHash Join Performance________________________________________________________________
- Suffer from CPU Cache StallsSuffer from CPU Cache Stalls - Most of execution time is wasted on data Most of execution time is wasted on data
cache missescache misses– 82% for partition, 73% for join 82% for partition, 73% for join – Because of random access patterns in Because of random access patterns in
memorymemory
55
Solution: Solution: Cache PrefetchingCache Prefetching ______________________________________________________
• Cache prefetching has been Cache prefetching has been successfully applied to several successfully applied to several types of applications.types of applications.
• exploit cache prefetching to exploit cache prefetching to improve hash join performance.improve hash join performance.
66
ChallengesChallenges to Cache to Cache PrefetchingPrefetching
____________________________________________________Difficult to obtain memory addresses earlyDifficult to obtain memory addresses early
– Randomness of hashing prohibits address prediction Randomness of hashing prohibits address prediction – Data dependencies within the processing of a tupleData dependencies within the processing of a tuple
Complexity of hash join codeComplexity of hash join code– Ambiguous pointer referencesAmbiguous pointer references– Multiple code pathsMultiple code paths– Cannot apply compiler prefetching techniquesCannot apply compiler prefetching techniques
77
Overcoming These ChallengesOvercoming These Challenges ________________________________________________________
-- Evaluate two new prefetching Evaluate two new prefetching techniques:techniques:
Group prefetching Group prefetching - try to hide cache miss latency - try to hide cache miss latency
across a across a group group tuplestuples
Software-pipelined prefetchingSoftware-pipelined prefetching--avoid these intermittent stallsavoid these intermittent stalls
88
Group PrefetchingGroup Prefetching
– Hide cache miss latency across a group Hide cache miss latency across a group tuples.tuples.
– Then combine the processing of a group of Then combine the processing of a group of tuples into a single loop body and rearrange tuples into a single loop body and rearrange the probe operations into stagesthe probe operations into stages
– Process the tuples for a stage and then move Process the tuples for a stage and then move to the next stageto the next stage
– Add prefetch instructions to the algorithm. Add prefetch instructions to the algorithm. – issue prefetch instructions in one code stage issue prefetch instructions in one code stage
for the memory references in the next code for the memory references in the next code stage. stage.
1010
Software-Pipelined Software-Pipelined PrefetchingPrefetching
– Overlaps cache misses across different Overlaps cache misses across different code stages of different tuplescode stages of different tuples
– The code stages of the same tuple are The code stages of the same tuple are processed in subsequent iterationsprocessed in subsequent iterations
– Can overlap the cache miss latency of Can overlap the cache miss latency of a tuple across all processing in an a tuple across all processing in an iterationiteration
1212
Group vs. Software-Pipelined PrefetchingGroup vs. Software-Pipelined Prefetching
Hiding latency:Hiding latency:– Software-pipelined pref is always able to hide Software-pipelined pref is always able to hide
all latencies all latencies
Book-keeping overhead:Book-keeping overhead:– Software-pipelined pref has more overheadSoftware-pipelined pref has more overhead
Code complexity:Code complexity:– Group prefetching is easier to implementGroup prefetching is easier to implement– Natural group boundary provides a place to do Natural group boundary provides a place to do
necessary processing left (e.g. for read-write necessary processing left (e.g. for read-write conflicts)conflicts)
– A natural place to send outputs to the parent A natural place to send outputs to the parent operator if pipelined operator is neededoperator if pipelined operator is needed
1313
Experimental SetupExperimental Setup
- Use a simple schema for both the Use a simple schema for both the build and probe relationsbuild and probe relations
- Every tuple contains a 4 byte join Every tuple contains a 4 byte join attribute and a fixed length payloadattribute and a fixed length payload
- Perform join without selections and Perform join without selections and projections.projections.
- Assume the join phase uses 50MB Assume the join phase uses 50MB memory to join a pair of build and memory to join a pair of build and probe partitionprobe partition
1414
Performance EvaluationPerformance EvaluationHash Join is CPU-bound with reasonable I/O Hash Join is CPU-bound with reasonable I/O
bandwidthbandwidth-The main total time is the elapsed real time of an -The main total time is the elapsed real time of an
algorithm phase. algorithm phase.
-The worker io stall time is the largest I/O stall time of -The worker io stall time is the largest I/O stall time of individual individual worker threads worker threads
1515
Performance Evaluation cont..Performance Evaluation cont..User-Mode CPU Cache PerformanceUser-Mode CPU Cache Performance- Join Phase PerformanceJoin Phase Performance
This technique achieved 3.02-4.04X speedups over original hash This technique achieved 3.02-4.04X speedups over original hash joinjoin
1616
Performance Evaluation cont..Performance Evaluation cont..
Join Performance varying Memory Join Performance varying Memory LatencyLatency-prefecthing techniques are effective even when the -prefecthing techniques are effective even when the processor/memory speed gap increases dramaticallyprocessor/memory speed gap increases dramatically
1818
Some Practical IssuesSome Practical IssuesSome issues may arise when implementing Some issues may arise when implementing these prefetching techniques in a these prefetching techniques in a production DBMS that targets multiple production DBMS that targets multiple architectures and is distributed as binaries.architectures and is distributed as binaries.
1.1. The syntax of prefetch instructions is often The syntax of prefetch instructions is often different across architectures and compilers.different across architectures and compilers.
2.2. Some architecture do not support faulting Some architecture do not support faulting prefetchesprefetches
3.3. Several architectures require software to explicitly Several architectures require software to explicitly manage the caches and network processorsmanage the caches and network processors
4.4. Pre-set parameters for the group size and the Pre-set parameters for the group size and the prefetch distance may be suboptimal on machines prefetch distance may be suboptimal on machines with very different configurationswith very different configurations
1919
ConclusionConclusion-- Even though prefetching is a promising Even though prefetching is a promising
technique for improving CPU cache technique for improving CPU cache performance, applying it to the hash join performance, applying it to the hash join algorithm is not straightforwardalgorithm is not straightforward (due to the dependencies within the processing of a (due to the dependencies within the processing of a single tuple single tuple and the randomness of Hashing)and the randomness of Hashing)
- Experimental results demonstrated that Experimental results demonstrated that hash join performance can be improved by hash join performance can be improved by using group prefetching and software-using group prefetching and software-pipelined prefetching techniques.pipelined prefetching techniques.
-- Several practicle issues when used on DBMS Several practicle issues when used on DBMS that targets multiple architecturesthat targets multiple architectures