Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer...
-
Upload
dallin-foden -
Category
Documents
-
view
213 -
download
1
Transcript of Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer...
Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture
School of Computer Engineering
Nanyang Technological University
27th Aug 2013
Jiong He, Mian Lu, Bingsheng He
Outline
• Motivations• System Design• Evaluations• Conclusions
Importance of Hash Joins
• In-memory databases– Enable GBs even TBs of data reside in main memory
(e.g., the large memory commodity servers)– Are hot research topic recently
• Hash joins– The most efficient join algorithm in main memory
databases– Focus: simple hash joins (SHJ, by ICDE 2004) and
partitioned hash joins (PHJ, by VLDB 1999)
3
Hash Joins on New Architectures
• Emerging hardware– Multi-core CPUs (8-core, 16-core, even many-core)– Massively parallel GPUs (NVIDIA, AMD, Intel, etc.)
• Query co-processing on new hardware– On multi-core CPUs: SIGMOD’11 (S. Blanas), …– On GPUs: SIGMOD’08 (B. He), VLDB’09 (C. Kim), …– On Cell: ICDE’07 (K. Ross), …
4
Bottlenecks
• Conventional query co-processing is inefficient– Data transfer overhead via PCI-e– Imbalanced workload distribution
5
CPU
Cache
GPU
Cache
Main Memory Device Memory
PCI-eLight-weight workload:
Create context,Send and receive data,Launch GPU program,
Post-processing.Heavy-weight workload:
All real computations.
CPU
GPU
The Coupled Architecture
CPU GPU
Cache
Main Memory
• Coupled CPU-GPU architecture– Intel Sandy Bridge, AMD Fusion APU, etc.
• New opportunities– Remove the data transfer overhead– Enable fine-grained workload scheduling– Increase higher cache reuse
6
Challenges Come with Opportunities
• Efficient data sharing– Share main memory– Share Last-level cache (LLC)
• Keep both processors busy– The GPU cannot dominate the performance– Assign suitable tasks for maximum speedup
Outline
• Motivations• System Design• Evaluations• Conclusions
Fine-Grained Definition of Steps for Co-Processing
• Hash join consists of three stages (partition, build and probe)
• Each stage consists of multiple steps (take build as example)– b1: compute # hash bucket
– b2: access hash bucket header
– b3: search the key list
– b4: insert the tuple
9
Co-Processing Mechanisms
• We study the following three kinds of co-processing mechanisms– Off-loading (OL)– Data-dividing (DD)– Pipeline (PL)
• With the fine-grained step definition of hash joins, we can easily implement algorithms with those co-processing mechanisms
10
Off-loading (OL)
• Method: Offload the whole step to one device• Advantage: Easy to schedule• Disadvantage: Imbalance
11
CPU GPU
Data-dividing (DD)
• Method: Partition the input at stage level• Advantage: Easy to schedule, no imbalance• Disadvantage: Devices are underutilized
12
CPU GPU
Pipeline (PL)
• Method: Partition the input at step level• Advantage: Balanced, devices are fully utilized• Disadvantage: Hard to schedule
13
CPU GPU
Determining Suitable Ratios for PL is Challenging
• Workload preferences of CPU & GPU vary• Different computation type & amount of memory
access across steps• Delay across steps should be minimized to
achieve global optimization
Cost Model
• Abstract model for CPU/GPU• Estimate data transfer costs, memory access
costs and execution costs• With the cost model, we can
– Estimate the elapsed time– Choose the optimal workload ratios
15
More details can be found in our paper.
Outline
• Motivations• System Design• Evaluations• Conclusions
System Setting Up
• System configurations
• Data sets– R and S relations with 16M tuples each– Two attributes in each tuple: (key, record-ID)– Data skew: uniform, low skew and high skew
17
# cores Core frequency (GHz)
Zero copy buffer (MB)
Local memory (KB)
Cache (MB)
CPU 4 3.0512
324
GPU 400 0.6 32
Discrete vs. Coupled Architecture
• In discrete architecture:– data transfer takes 4%~10%– merge takes 14%~18%
• The coupled architecture outperforms the discrete by 5%~21% among all variants
discrete coupled discrete coupled discrete coupled discrete coupled0
0.5
1
1.5
2
2.5
3
3.5data-transfer merge partition build probe
Ela
pse
d ti
me
(s)
SHJ-DD SHJ-OL PHJ-DD PHJ-OL
15.3%
5.1%
21.5%
6.2%
Fine-grained vs. Coarse-grained
• For SHJ, PL outperforms OL & DD by 38% and 27%• For PHJ, PL outperforms OL & DD by 39% and 23%
19
SHJ PHJ0
0.5
1
1.5
2
2.5
3
OL (GPU-only) DD PL (Fine-grained)
Ela
ps
ed
tim
e (
s)
23%
39%
27%
38%
Unit Costs in Different Steps
• Unit cost represents the average processing time of one tuple for one device in one step
• Costs vary heavily across different steps on two devices
20
pr1 pr2 pr3 b1 b2 b3 b4 p1 p2 p3 p40
5
10
15
20
25CPU GPU
Ela
ps
ed
tim
e p
er
tup
le (
ns
)
Partition Build Probe
Ratios Derived from Cost Model
• Ratios across steps are different– In the first step of all three stages, the GPU takes
should take most of the work (i.e. hashing)
• Workload dividing are fine-grained at step level
21
Other Findings
• Results on skewed data• Results on input with varying size• Evaluations on some design tradeoffs, etc.
More details can be found in our paper.
Outline
• Motivations• System Design• Evaluations• Conclusions
Conclusions
• Implement hash joins on the discrete and the coupled CPU-GPU architectures
• Propose a generic cost model to guide the fine-grained tuning to get optimal performance
• Evaluate some design tradeoffs to make hash join better exploit the hardware power
• The first systematic study of hash join co-processing on the emerging coupled CPU-GPU architecture
24
Future Work
• Design a full-fledged query processor• Extend the fine-grained design methodology to
other applications on the coupled CPU-GPU architecture
25
Acknowledgement
• Thank Dr. Qiong Luo and Ong Zhong Liang for their valuable comments
• This work is partly supported by a MoE AcRF Tier 2 grant (MOE2012-T2-2-067) in Singapore and an Interdisciplinary Strategic Competitive Fund of Nanyang Technological University 2011 for “C3: Cloud-Assisted Green Computing at NTU Campus”
26
Questions?
27