S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena...

S5146: Data Movement Options for Scalable GPU Cluster Communication

Benjamin Klenk, PhD Student Institute of Computer Engineering

Ruprecht-Karls University of Heidelberg Germany

http://www.ziti.uni-heidelberg.de/ziti/en/ce-home

GTC 2015, San Jose, CA, US, 03/19/2015


S5146 Data Movement Options for Scalable GPU Cluster Communication

CUDA Programming Model

▪ GPU Computing & CUDA • Thread hierarchy, shared memory, barrier • SIMT – Single Instruction, Multiple Threads

▪ Collaborative computing • Partitioning, divergence • Synchronization

▪ Collaborative memory accesses • Slackness to avoid large caching structures • Strong need for coalescing • Caching to reduce traffic on memory bus

➔ What about communication?

!2

…

Output data setCo

mpu

te

…

Mem

ory Access

MC


A single GPU is barely enough…

▪HPC demands for unlimited computational power ▪Workloads don’t fit in memory

• Graph computation • Deep learning • Molecular dynamics, astrophysics ▪Deploy several GPUs

• More FLOP/s • More GBs

➔ But: communication ➔ CUDA isn’t enough

!3

NIC

CPU

GPU

GDDR5

Network Fabric

DDR3

DDR3

DDR3

DDR3

12 GB/s

60 GB/s

288 GB/s

16 GB/s

16 GB/s

1.4 TFLOP/s

0.13 TFLOP/s

12 GB

64 GB

What am I going to talk about?

✦How does communication currently look like? ✦Problems with current models ✦ Introducing a global address space for GPUs ✦Performance and energy measurements

�4S5146 Data Movement Options for Scalable GPU Cluster Communication


Review: Messaging-‐based Communication

!5

▪ MPI as de-facto standard ▪ CPU controls communication

▪ Put/Get • Memory registration • OS & driver interactions

▪ Work request generation ▪ Notification handling

• Where to put them?

Review: One-‐sided Communication

▪ MPI as de-facto standard ▪ CPU controls communication

▪ Put/Get • Memory registration • OS & driver interactions

▪ Work request generation ▪ Notification handling

• Where to put them?

GPU CPU NICPCIe PCIe

NIC CPU GPUPCIe PCIeNetwork

0

issue work request

network package write data to GPU memory 0

0

completion notification

read data from GPU memory

completion notificationx

CUDA stack1 20 xMPI stackComputation Possible overlap


B. Klenk, L. Oden, and H. Fröning, “Analyzing put/get apis for thread-‐collaborative processors,” in HUCAA Workshop in conjunction with ICPP, Minneapolis, MN, USA, 2014.


The Problem in Numbers

▪ IB Verbs QDR: CPU vs GPU ▪GPUs are incompatible with

messaging • Generating work requests • Registering memory • Polling on notifications • Controlling networking devices ▪ Bandwidth ~100x lower ▪ Kernel launch time equals a

32kB data movement

!7

See also: L. Oden, H. Fröning, F.J. Pfreundt, Infiniband-‐Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, ASHES Workshop at IPDPS2014, to be published.


GGAS – Global GPU Address SpacesReminder: everything in CUDA is thread-collaborative

!8


Let’s get back to collaborative work

▪ GAS across GPUs • Address translation / target

identification • Special hardware support

required (NIC, EXTOLL)

▪ Severe limitations for full coherence and strong consistency

▪ Reverting to highly relaxed consistency models

!9

Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE International Conference on Cluster Computing 2013, September 23-‐27, 2013, Indianapolis, US.

EXTOLL

▪HPC interconnection technology ▪ FPGA based (Xilinx Virtex-6)

• 157 MHz @ 64 bit datapaths • PCIe 2.0 • 4 Ports @ 16 Gb/s per direction

▪ ASIC under production • PCIe 3.0 (+root port) • 6 + 1 Ports @ 120 Gb/s per direction

▪MPI, Low-Level API, Open Source ▪ SMFU: supports GGAS

Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010), co-‐located IPDPS 2010, April 19, 2010, Atlanta, Georgia.

www.extoll.de


http://www.extoll.de


GGAS – thread-‐collaborative BSP-‐like communication

!11

Computation…

…remote stores

…Continue …

Global barrier

… Computation

…Continue …

…

Computation

GDDR5

double *remote; remote = (double*) \\ get_ptr_of( node ) ; remote[ tid ] = data[ tid ] ;

do_work();

ggas_barrier();


GGAS – Bandwidth comparison

!12

▪MPI • CPU-controlled • cudaMemcpy D2H + MPI_Send • MPI_Recv + cudaMemcpy H2D ▪GGAS

• GPU-controlled, GDDR to GDDR ▪RMA Direct

• GPU-controlled, GDDR to GDDR ▪RMA Host

• CPU-controlled • cudaMemcpy D2H + RMA_Put, • Get notification + cudaMemcpy H2D

GGAS ~ 2 µs RMA ~ 5 µs MPI ~ 10 µs

latency


Analyzing Communication Models for Thread-parallel Processors in Terms of Energy and Time

How does GGAS compete with other methods?

!13


Methodology

▪High Performance Computing; time to solution primary metric ▪ But: energy becomes dominating factor

▪We measure time, but we want to consider energy ▪ Power consumption needs to be determined, too

• CPU, DRAM: Intel RAPL • GPU: NVIDIA NVML

▪How do applications perform with regard to performance and energy?

!14


Allreduce – Power and Energy Analysis

!15

Lena Oden, Benjamin Klenk and Holger Fröning, Energy-‐Efficient Collective Reduce and Allreduce Operations on Distributed GPUs, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 26-‐29, 2014, Chicago, IL, US.

GGAS MPI

Accumulated energy consumption over time:

H. Fröning - Towards efficient communication methods and models - 2015/01/09

Workload analysis / application performance

▪ 12 Nodes (each 2xIntel Ivy Bridge, NVIDIA K20, EXTOLL FPGA) ▪ Normalized to MPI (higher than 1 > better performance, lower than 1 > worse performance)

!16

Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-‐Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA. March 29-‐31, 2015.

H. Fröning - Towards efficient communication methods and models - 2015/01/09

Energy analysis

▪ Same cluster (12 nodes) ▪ Normalized to MPI

• lower than 1 > less energy • higher than 1 > more energy

▪ GGAS: 25% less energy ▪ RMA: 20% less energy

▪ Why? • Less power, CPU can sleep • Less execution time

!17

1. Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-‐Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA. March 29-‐31, 2015.


ConclusionsWhat have we learned?

!18

Summary

▪CPU controlled communication (MPI incl. MVAPICH & CUDA-aware MPI) • State-of-the-art • Context/domain switches → additional overhead due kernel start latency • Additional data copies → GPUDirect RDMA only for small messages • Programming complexity increases → CUDA+MPI+X, X ∈ {OpenMP, pthreads,..}

▪GPU controlled communication • Currently needs specialized hardware (e.g. EXTOLL) • Performance promising • Power consumption can be reduced by putting CPU to sleep mode • Inline with CUDA programming model



Conclusion

▪ Post-Dennard scaling ▪Communication/computation gap will dramatically increase in the future

➔ Heterogeneity in communication

▪ Abstractions and adaptivity minimizes complexity • Hardware optimizations and software libraries to support efficient

communication • Adaptive tasks models support dynamic application behavior • Hide architectural complexity

!20

Specialized processors like GPUs require specialized communication models


Synergies: High Octane Project

▪Communication-centric cluster ▪ 8 Nodes

• 16 Intel Ivy Bridge CPUs • 16 NVIDIA K20 (currently 8) • 16 EXTOLL NICs (currently 8)

▪ Put/Get, MPI & GGAS support ▪Opportunity for other researchers with

various possible interactions • System-level software • Workloads from HPC and other domains • Compilers • Optimizations

!21

DDR3CPU

DDR3

GPU

GDDR5

NIC

GPU

GDDR5

NIC

CPU


Credits

!22

Thank you!

Credits Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student)

Discussions: Sudha Yalamanchili (Georgia Tech), Mark Hummel (Nvidia) Sponsoring: NVIDIA, Xilinx, German Excellence Initiative, Google EXTOLL: Ulrich Brüning, Mondrian Nüssle and the complete team

Current main interactions



S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena...

Documents

Transcript of S5146: Data Movement Options for Scalable GPU Cluster … · 2015-03-19 · Contributions: Lena...