S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena...

22
S5146: Data Movement Options for Scalable GPU Cluster Communication Benjamin Klenk, PhD Student Institute of Computer Engineering Ruprecht-Karls University of Heidelberg Germany http://www.ziti.uni-heidelberg.de/ziti/en/ce-home GTC 2015, San Jose, CA, US, 03/19/2015

Transcript of S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena...

Page 1: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146: Data Movement Options for Scalable GPU Cluster Communication

Benjamin Klenk, PhD Student Institute of Computer Engineering

Ruprecht-Karls University of Heidelberg Germany

http://www.ziti.uni-heidelberg.de/ziti/en/ce-home

GTC 2015, San Jose, CA, US, 03/19/2015

Page 2: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

CUDA  Programming  Model

▪ GPU Computing & CUDA • Thread hierarchy, shared memory, barrier • SIMT – Single Instruction, Multiple Threads

▪ Collaborative computing • Partitioning, divergence • Synchronization

▪ Collaborative memory accesses • Slackness to avoid large caching structures • Strong need for coalescing • Caching to reduce traffic on memory bus

➔ What about communication?

!2

Output  data  setCo

mpu

te

Mem

ory  Access

MC

Page 3: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

A  single  GPU  is  barely  enough…

▪HPC demands for unlimited computational power ▪Workloads don’t fit in memory

• Graph computation • Deep learning • Molecular dynamics, astrophysics ▪Deploy several GPUs

• More FLOP/s • More GBs

➔ But: communication ➔ CUDA isn’t enough

!3

NIC

CPU

GPU

GDDR5

Network  Fabric

DDR3

DDR3

DDR3

DDR3

12  GB/s

60  GB/s

288  GB/s

16  GB/s

16  GB/s

1.4  TFLOP/s

0.13  TFLOP/s

12  GB

64  GB

Page 4: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

What  am  I  going  to  talk  about?

✦How does communication currently look like? ✦Problems with current models ✦ Introducing a global address space for GPUs ✦Performance and energy measurements

�4S5146 Data Movement Options for Scalable GPU Cluster Communication

Page 5: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Review:  Messaging-­‐based  Communication

!5

▪ MPI as de-facto standard ▪ CPU controls communication

▪ Put/Get • Memory registration • OS & driver interactions

▪ Work request generation ▪ Notification handling

• Where to put them?

Page 6: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

Review:  One-­‐sided  Communication

▪ MPI as de-facto standard ▪ CPU controls communication

▪ Put/Get • Memory registration • OS & driver interactions

▪ Work request generation ▪ Notification handling

• Where to put them?

GPU CPU NICPCIe PCIe

NIC CPU GPUPCIe PCIeNetwork

0

issue work request

network package write data to GPU memory 0

0

completion notification

read data from GPU memory

completion notificationx

CUDA stack1 20 xMPI stackComputation Possible overlap

�6S5146 Data Movement Options for Scalable GPU Cluster Communication

B.  Klenk,  L.  Oden,  and  H.  Fröning,  “Analyzing  put/get  apis  for  thread-­‐collaborative  processors,”  in  HUCAA  Workshop  in  conjunction  with  ICPP,  Minneapolis,  MN,  USA,  2014.  

Page 7: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

The  Problem  in  Numbers

▪ IB Verbs QDR: CPU vs GPU ▪GPUs are incompatible with

messaging • Generating work requests • Registering memory • Polling on notifications • Controlling networking devices ▪ Bandwidth ~100x lower ▪ Kernel launch time equals a

32kB data movement

!7

See  also:  L.  Oden,  H.  Fröning,  F.J.  Pfreundt,  Infiniband-­‐Verbs  on  GPU:  A  case  study  of  controlling  an  Infiniband  network  device  from  the  GPU,  ASHES  Workshop  at  IPDPS2014,  to  be  published.

Page 8: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

GGAS – Global GPU Address SpacesReminder: everything in CUDA is thread-collaborative

!8

Page 9: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Let’s  get  back  to  collaborative  work

▪ GAS across GPUs • Address translation / target

identification • Special hardware support

required (NIC, EXTOLL)

▪ Severe limitations for full coherence and strong consistency

▪ Reverting to highly relaxed consistency models

!9

Lena  Oden  and  Holger  Fröning,  GGAS:  Global  GPU  Address  Spaces  for    Efficient  Communication  in  Heterogeneous  Clusters,  IEEE  International  Conference  on  Cluster  Computing  2013,  September  23-­‐27,  2013,  Indianapolis,  US.

Page 10: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

EXTOLL

▪HPC interconnection technology ▪ FPGA based (Xilinx Virtex-6)

• 157 MHz @ 64 bit datapaths • PCIe 2.0 • 4 Ports @ 16 Gb/s per direction

▪ ASIC under production • PCIe 3.0 (+root port) • 6 + 1 Ports @ 120 Gb/s per direction

▪MPI, Low-Level API, Open Source ▪ SMFU: supports GGAS

Holger  Fröning  and  Heiner  Litz,  Efficient  Hardware  Support  for  the  Partitioned  Global  Address  Space,  10th  Workshop  on  Communication  Architecture  for  Clusters  (CAC2010),  co-­‐located  IPDPS  2010,  April  19,  2010,  Atlanta,  Georgia.  

www.extoll.de  

�10S5146 Data Movement Options for Scalable GPU Cluster Communication

Page 11: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

GGAS  –  thread-­‐collaborative  BSP-­‐like  communication

!11

Computation…

…remote  stores

…Continue  …

Global  barrier

… Computation

…Continue  …

Computation

GDDR5

double *remote; remote = (double*) \\ get_ptr_of( node ) ; remote[ tid ] = data[ tid ] ;

do_work();

ggas_barrier();

Page 12: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

GGAS  –  Bandwidth  comparison

!12

▪MPI • CPU-controlled • cudaMemcpy D2H + MPI_Send • MPI_Recv + cudaMemcpy H2D ▪GGAS

• GPU-controlled, GDDR to GDDR ▪RMA Direct

• GPU-controlled, GDDR to GDDR ▪RMA Host

• CPU-controlled • cudaMemcpy D2H + RMA_Put, • Get notification + cudaMemcpy H2D

GGAS  ~  2  µs  RMA  ~    5  µs  MPI  ~  10  µs

latency

Page 13: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Analyzing Communication Models for Thread-parallel Processors in Terms of Energy and Time

How does GGAS compete with other methods?

!13

Page 14: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Methodology

▪High Performance Computing; time to solution primary metric ▪ But: energy becomes dominating factor

▪We measure time, but we want to consider energy ▪ Power consumption needs to be determined, too

• CPU, DRAM: Intel RAPL • GPU: NVIDIA NVML

▪How do applications perform with regard to performance and energy?

!14

Page 15: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Allreduce  –  Power  and  Energy  Analysis

!15

Lena  Oden,  Benjamin  Klenk  and  Holger  Fröning,  Energy-­‐Efficient  Collective  Reduce  and  Allreduce  Operations  on  Distributed  GPUs,  14th  IEEE/ACM  International  Symposium  on  Cluster,  Cloud  and  Grid  Computing  (CCGrid),  May  26-­‐29,  2014,  Chicago,  IL,  US.

GGAS MPI

Accumulated  energy  consumption  over  time:

Page 16: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

H. Fröning - Towards efficient communication methods and models - 2015/01/09

Workload  analysis  /  application  performance

▪ 12 Nodes (each 2xIntel Ivy Bridge, NVIDIA K20, EXTOLL FPGA) ▪ Normalized to MPI (higher than 1 > better performance, lower than 1 > worse performance)

!16

Benjamin  Klenk,  Lena  Oden,  Holger  Fröning,  Analyzing  Communication  Models  for  Distributed  Thread-­‐Collaborative  Processors  in  Terms  of  Energy  and  Time,  2015  IEEE  International  Symposium  on  Performance  Analysis  of  Systems  and  Software  (ISPASS  2015),  Philadelphia,  PA.  March  29-­‐31,  2015.  

Page 17: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

H. Fröning - Towards efficient communication methods and models - 2015/01/09

Energy  analysis

▪ Same cluster (12 nodes) ▪ Normalized to MPI

• lower than 1 > less energy • higher than 1 > more energy

▪ GGAS: 25% less energy ▪ RMA: 20% less energy

▪ Why? • Less power, CPU can sleep • Less execution time

!17

1. Benjamin  Klenk,  Lena  Oden,  Holger  Fröning,  Analyzing  Communication  Models  for  Distributed  Thread-­‐Collaborative  Processors  in  Terms  of  Energy  and  Time,  2015  IEEE  International  Symposium  on  Performance  Analysis  of  Systems  and  Software  (ISPASS  2015),  Philadelphia,  PA.  March  29-­‐31,  2015.  

Page 18: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

ConclusionsWhat have we learned?

!18

Page 19: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

Summary

▪CPU controlled communication (MPI incl. MVAPICH & CUDA-aware MPI) • State-of-the-art • Context/domain switches → additional overhead due kernel start latency • Additional data copies → GPUDirect RDMA only for small messages • Programming complexity increases → CUDA+MPI+X, X ∈ {OpenMP, pthreads,..}

▪GPU controlled communication • Currently needs specialized hardware (e.g. EXTOLL) • Performance promising • Power consumption can be reduced by putting CPU to sleep mode • Inline with CUDA programming model

�19S5146 Data Movement Options for Scalable GPU Cluster Communication

Page 20: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Conclusion

▪ Post-Dennard scaling ▪Communication/computation gap will dramatically increase in the future

➔ Heterogeneity in communication

▪ Abstractions and adaptivity minimizes complexity • Hardware optimizations and software libraries to support efficient

communication • Adaptive tasks models support dynamic application behavior • Hide architectural complexity

!20

Specialized  processors  like  GPUs  require  specialized  communication  models

Page 21: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Synergies:  High  Octane  Project

▪Communication-centric cluster ▪ 8 Nodes

• 16 Intel Ivy Bridge CPUs • 16 NVIDIA K20 (currently 8) • 16 EXTOLL NICs (currently 8)

▪ Put/Get, MPI & GGAS support ▪Opportunity for other researchers with

various possible interactions • System-level software • Workloads from HPC and other domains • Compilers • Optimizations

!21

DDR3CPU

DDR3

GPU

GDDR5

NIC

GPU

GDDR5

NIC

CPU

Page 22: S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student) Discussions:

S5146 Data Movement Options for Scalable GPU Cluster Communication

Credits

!22

Thank you!

Credits Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student)

Discussions: Sudha Yalamanchili (Georgia Tech), Mark Hummel (Nvidia) Sponsoring: NVIDIA, Xilinx, German Excellence Initiative, Google EXTOLL: Ulrich Brüning, Mondrian Nüssle and the complete team

Current main interactions

http://www.ziti.uni-heidelberg.de/ziti/en/ce-home