1 Improving Hash Join Performance through Prefetching...

11

Improving Hash Join Improving Hash Join PerformancePerformance

through Prefetchingthrough Prefetching__________________________________________________________________________________________________

ByBySHIMIN CHENSHIMIN CHEN

Intel Research PittsburghIntel Research PittsburghANASTASSIA AILAMAKIANASTASSIA AILAMAKI

Carnegie Mellon UniversityCarnegie Mellon UniversityPHILLIP B. GIBBONSPHILLIP B. GIBBONS

Intel Research PittsburghIntel Research Pittsburghandand

TODD C. MOWRYTODD C. MOWRYCarnegie Mellon University and Intel Research PittsburghCarnegie Mellon University and Intel Research Pittsburgh

-Manisha Singh -Manisha Singh 22926972292697

22

OutlineOutline________________________________________________________

-- OverviewOverview

-- Proposed TechniquesProposed Techniques

-- Experimental setup Experimental setup

-- Performance evaluationPerformance evaluation

-- ConclusionConclusion

33

Hash JoinsHash Joins________________________________________________________________

- used in the implementation of a relational used in the implementation of a relational database management systemdatabase management system

- Two relation – build (small) and probe Two relation – build (small) and probe (large). (large).

- Excessive random I/OsExcessive random I/Os-If the build relation and hash table can not fit in -If the build relation and hash table can not fit in

memorymemoryBuild

RelationProbe Relation

Hash Table

44

Hash Join PerformanceHash Join Performance________________________________________________________________

- Suffer from CPU Cache StallsSuffer from CPU Cache Stalls - Most of execution time is wasted on data Most of execution time is wasted on data

cache missescache misses– 82% for partition, 73% for join 82% for partition, 73% for join – Because of random access patterns in Because of random access patterns in

memorymemory

55

Solution: Solution: Cache PrefetchingCache Prefetching ______________________________________________________

• Cache prefetching has been Cache prefetching has been successfully applied to several successfully applied to several types of applications.types of applications.

• exploit cache prefetching to exploit cache prefetching to improve hash join performance.improve hash join performance.

66

ChallengesChallenges to Cache to Cache PrefetchingPrefetching

____________________________________________________Difficult to obtain memory addresses earlyDifficult to obtain memory addresses early

– Randomness of hashing prohibits address prediction Randomness of hashing prohibits address prediction – Data dependencies within the processing of a tupleData dependencies within the processing of a tuple

Complexity of hash join codeComplexity of hash join code– Ambiguous pointer referencesAmbiguous pointer references– Multiple code pathsMultiple code paths– Cannot apply compiler prefetching techniquesCannot apply compiler prefetching techniques

77

Overcoming These ChallengesOvercoming These Challenges ________________________________________________________

-- Evaluate two new prefetching Evaluate two new prefetching techniques:techniques:

Group prefetching Group prefetching - try to hide cache miss latency - try to hide cache miss latency

across a across a group group tuplestuples

Software-pipelined prefetchingSoftware-pipelined prefetching--avoid these intermittent stallsavoid these intermittent stalls

88

Group PrefetchingGroup Prefetching

– Hide cache miss latency across a group Hide cache miss latency across a group tuples.tuples.

– Then combine the processing of a group of Then combine the processing of a group of tuples into a single loop body and rearrange tuples into a single loop body and rearrange the probe operations into stagesthe probe operations into stages

– Process the tuples for a stage and then move Process the tuples for a stage and then move to the next stageto the next stage

– Add prefetch instructions to the algorithm. Add prefetch instructions to the algorithm. – issue prefetch instructions in one code stage issue prefetch instructions in one code stage

for the memory references in the next code for the memory references in the next code stage. stage.

99

Group PrefetchingGroup Prefetching

1010

Software-Pipelined Software-Pipelined PrefetchingPrefetching

– Overlaps cache misses across different Overlaps cache misses across different code stages of different tuplescode stages of different tuples

– The code stages of the same tuple are The code stages of the same tuple are processed in subsequent iterationsprocessed in subsequent iterations

– Can overlap the cache miss latency of Can overlap the cache miss latency of a tuple across all processing in an a tuple across all processing in an iterationiteration

1111

Software-Pipelined Software-Pipelined PrefetchingPrefetching

1212

Group vs. Software-Pipelined PrefetchingGroup vs. Software-Pipelined Prefetching

Hiding latency:Hiding latency:– Software-pipelined pref is always able to hide Software-pipelined pref is always able to hide

all latencies all latencies

Book-keeping overhead:Book-keeping overhead:– Software-pipelined pref has more overheadSoftware-pipelined pref has more overhead

Code complexity:Code complexity:– Group prefetching is easier to implementGroup prefetching is easier to implement– Natural group boundary provides a place to do Natural group boundary provides a place to do

necessary processing left (e.g. for read-write necessary processing left (e.g. for read-write conflicts)conflicts)

– A natural place to send outputs to the parent A natural place to send outputs to the parent operator if pipelined operator is neededoperator if pipelined operator is needed

1313

Experimental SetupExperimental Setup

- Use a simple schema for both the Use a simple schema for both the build and probe relationsbuild and probe relations

- Every tuple contains a 4 byte join Every tuple contains a 4 byte join attribute and a fixed length payloadattribute and a fixed length payload

- Perform join without selections and Perform join without selections and projections.projections.

- Assume the join phase uses 50MB Assume the join phase uses 50MB memory to join a pair of build and memory to join a pair of build and probe partitionprobe partition

1414

Performance EvaluationPerformance EvaluationHash Join is CPU-bound with reasonable I/O Hash Join is CPU-bound with reasonable I/O

bandwidthbandwidth-The main total time is the elapsed real time of an -The main total time is the elapsed real time of an

algorithm phase. algorithm phase.

-The worker io stall time is the largest I/O stall time of -The worker io stall time is the largest I/O stall time of individual individual worker threads worker threads

1515

Performance Evaluation cont..Performance Evaluation cont..User-Mode CPU Cache PerformanceUser-Mode CPU Cache Performance- Join Phase PerformanceJoin Phase Performance

This technique achieved 3.02-4.04X speedups over original hash This technique achieved 3.02-4.04X speedups over original hash joinjoin

1616

Performance Evaluation cont..Performance Evaluation cont..

Join Performance varying Memory Join Performance varying Memory LatencyLatency-prefecthing techniques are effective even when the -prefecthing techniques are effective even when the processor/memory speed gap increases dramaticallyprocessor/memory speed gap increases dramatically

1717

Performance Evaluation cont..Performance Evaluation cont..

1818

Some Practical IssuesSome Practical IssuesSome issues may arise when implementing Some issues may arise when implementing these prefetching techniques in a these prefetching techniques in a production DBMS that targets multiple production DBMS that targets multiple architectures and is distributed as binaries.architectures and is distributed as binaries.

1.1. The syntax of prefetch instructions is often The syntax of prefetch instructions is often different across architectures and compilers.different across architectures and compilers.

2.2. Some architecture do not support faulting Some architecture do not support faulting prefetchesprefetches

3.3. Several architectures require software to explicitly Several architectures require software to explicitly manage the caches and network processorsmanage the caches and network processors

4.4. Pre-set parameters for the group size and the Pre-set parameters for the group size and the prefetch distance may be suboptimal on machines prefetch distance may be suboptimal on machines with very different configurationswith very different configurations

1919

ConclusionConclusion-- Even though prefetching is a promising Even though prefetching is a promising

technique for improving CPU cache technique for improving CPU cache performance, applying it to the hash join performance, applying it to the hash join algorithm is not straightforwardalgorithm is not straightforward (due to the dependencies within the processing of a (due to the dependencies within the processing of a single tuple single tuple and the randomness of Hashing)and the randomness of Hashing)

- Experimental results demonstrated that Experimental results demonstrated that hash join performance can be improved by hash join performance can be improved by using group prefetching and software-using group prefetching and software-pipelined prefetching techniques.pipelined prefetching techniques.

-- Several practicle issues when used on DBMS Several practicle issues when used on DBMS that targets multiple architecturesthat targets multiple architectures

2020

Thank youThank you

Questions?Questions?

1 Improving Hash Join Performance through Prefetching...

Documents

Transcript of 1 Improving Hash Join Performance through Prefetching...