rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council...

83
rCUDA: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort

Transcript of rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council...

Page 1: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

rCUDA: towards energy-efficiency in GPU computing by leveraging low-power processors

and InfiniBand interconnects

Federico Silla Technical University of Valencia

Spain

Joint research effort

Page 2: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 2/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors

Page 3: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 3/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors

Page 4: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 4/83

GPU computing defines all the technological issues for using the GPU computational power for executing general purpose applications

GPU computing has experienced a remarkable growth in the last years

Current GPU computing facilities

0

10

20

30

40

50

60

70

Num

ber o

f Sys

tem

s

Clearspeed IBM PowerCell NVIDIA ATI/AMD Intel MIC / Phi

Top500 June 2013

Page 5: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 5/83

PCI-e

CPU

GPU

G

PU

mem

Main Memory

Net

wor

k

GPU

G

PU

mem

GPU

G

PU

mem

GPU

G

PU

mem

GPU

G

PU

mem

GPU

G

PU

mem

Main Memory

CPU

Net

wor

k

PCI-e

The basic building block is a node with one or more GPUs

Current GPU computing facilities

Page 6: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 6/83

From the programming point of view: A set of nodes, each one with:

one or more CPUs (probably with several cores per CPU) one or more GPUs (typically between 1 and 4) disjoint memory spaces for each GPU and the CPUs

An interconnection network

Current GPU computing facilities

Page 7: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 7/83

For some kinds of code the use of GPUs brings huge benefits in terms of performance and energy

There must be data parallelism in the code: this is the only way to take benefit from the hundreds of processors within a GPU

Different scenarios from the point of view of the application:

Low level of data parallelism

High level of data parallelism

Moderate level of data parallelism

Applications for multi-GPU computing

Leveraging up to the 4 GPUs inside the node

Current GPU computing facilities

Page 8: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 8/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors

Page 9: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 9/83

A GPU computing facility is usually a set of independent self-contained nodes that leverage the shared-nothing approach

Nothing is directly shared among nodes (MPI required for aggregating

computing resources within the cluster)

GPUs can only be used within the node they are attached to

How GPU virtualization adds value

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

Interconnection Network

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

Page 10: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 10/83

CONCERNS:

Multi-GPU applications running on a subset of nodes cannot make use of the tremendous GPU resources available at other cluster nodes (even if they are idle)

For applications with moderate levels of data parallelism, many GPUs in the cluster may be idle for long periods of time + GPU

+ GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU + GPU

Current trend in GPU deployment

How GPU virtualization adds value

Page 11: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 11/83

+ GPU

+ GPU

+ GPU

+ GPU

+ GPU

A way of addressing the first concern is by sharing the GPUs present in the cluster among all the nodes. For the second concern, once GPUs are shared, their amount can be reduced

This would increase GPU utilization, also lowering power consumption, at the same time that initial acquisition costs are reduced

This new configuration requires:

1) A way of seamlessly sharing GPUs across nodes in the cluster (GPU virtualization)

2) Enhanced job schedulers that take into account the new GPU configuration

How GPU virtualization adds value

Page 12: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 12/83

In summary, GPU virtualization allows a new vision of a GPU deployment, moving from the usual cluster configuration:

How GPU virtualization adds value

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

Interconnection Network

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

to the following one ….

Page 13: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 13/83

How GPU virtualization adds value

CPU

Main M

emory

Network

PCI-e

PCI-e

CPU

Main M

emory

Network

CPU

Main M

emory

Network

PCI-e

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

Interconnection Network

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e C

PU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

Physical configuration

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

PCI-e

CPU

Main M

emory

Network

Interconnection Network

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Logical connections

Logical configuration

Page 14: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 14/83

How GPU virtualization adds value

Without GPU virtualization

GPU virtualization is also useful for multi-GPU applications

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

CPU

Main M

emory

Network

PCI-e

PCI-e

CPU

Main M

emory

Network

Interconnection Network

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Logical connections PC

I-e

CPU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

Interconnection Network

PCI-e

CPU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e

CPU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e

CPU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e

CPU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

PCI-e

CPU

GPU GPU mem

Main M

emory

Network

GPU GPU mem

With GPU virtualization

Many GPUs in the cluster can be provided

to the application

Only the GPUs in the node can be provided

to the application

Page 15: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 15/83

Physical configuration Logical configuration

How GPU virtualization adds value

Real local GPUs

Node GPU GPU

Node

Node GPU GPU

Node

Interconnection Netw

ork

Virtualized remote GPUs

Node GPU GPU

Node GPU GPU

Node GPU GPU

Node GPU GPU

GPU GPU

GPU GPU

GPU GPU

GPU GPU

Interconnection Netw

ork

GPU virtualization allows all nodes to access all GPUs

Page 16: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 16/83

+ GPU

+ GPU

+ GPU

+ GPU

+ GPU

One step further:

enhancing the scheduling process so that GPU servers are put into low-power sleeping modes as soon as their acceleration features are not required

How GPU virtualization adds value

Page 17: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 17/83

Going even beyond: consolidating GPUs into dedicated

GPU boxes (no CPU power required) allowing GPU task migration ???

TRUE GREEN GPU COMPUTING

How GPU virtualization adds value

Page 18: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 18/83

0 2000 4000 6000 8000 10000 12000 14000 16000 180000

10

20

30

40

50

60

70

80

90

100Influence of data transfers for SGEMM

Pinned MemoryNon-Pinned Memory

Matrix Size

Tim

e de

vote

d to

dat

a tra

nsfe

rs (%

)

Main GPU virtualization drawback is the increased latency and reduced bandwidth to the remote GPU

How GPU virtualization adds value

Data from a matrix-matrix multiplication using a local GPU!!!

Page 19: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 19/83

Several efforts have been made regarding GPU virtualization during the last years:

rCUDA (CUDA 5.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vCUDA (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V-GPU (CUDA 4.0)

How GPU virtualization adds value

Publicly available

NOT publicly available

Page 20: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 20/83

Performance comparison of GPU virtualization solutions: Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture) GPU NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch CentOS 6.3 + Mellanox OFED 1.5.3

Pageable

H2D Pinned

H2D Pageable

D2H Pinned

D2H CUDA 34,3 4,3 16,2 5,2

rCUDA 94,5 23,1 292,2 6,0

GVirtuS 184,2 200,3 168,4 182,8

DS-CUDA 45,9 - 26,5 -

How GPU virtualization adds value

Latency (measured by transferring 64 bytes)

Page 21: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 21/83

MB/

sec

Transfer size (MB)

How GPU virtualization adds value

Host-to-Device pageable memory

Device-to-Host pageable memory

Host-to-Device pinned memory

Device-to-Host pinned memory

MB/

sec

Transfer size (MB)

MB/

sec

Transfer size (MB)

MB/

sec

Transfer size (MB)

Bandwidth of a copy between GPU and CPU memories

Page 22: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 22/83

How GPU virtualization adds value

rCUDA has been successfully tested with the following applications: NVIDIA CUDA SDK Samples LAMMPS WideLM CUDASW++ OpenFOAM HOOMDBlue mCUDA-MEME GPU-Blast Gromacs

Page 23: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 23/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors

Page 24: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 24/83

It is useful for:

Moderate level of data parallelism

Applications for multi-GPU computing

A framework enabling a CUDA-based application running in one (or some) node(s)

to access GPUs in other nodes

Basics of the rCUDA framework

Page 25: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 25/83

Application

CUDA application

CUDA libraries

Basics of the rCUDA framework

Page 26: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 26/83

CUDA libraries

Application

CUDA application

Application

Client side Server side

CUDA libraries

Basics of the rCUDA framework

Page 27: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 27/83

CUDA application

Application

Client side Server side

CUDA libraries

rCUDA library

rCUDA daemon

Network device

Network device

Basics of the rCUDA framework

Page 28: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 28/83

CUDA application

Application

Client side Server side

CUDA libraries

rCUDA library

rCUDA daemon

Network device

Network device

Basics of the rCUDA framework

Page 29: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 29/83

CUDA application

Application

Client side Server side

CUDA libraries

rCUDA library

rCUDA daemon

Network device

Network device

Basics of the rCUDA framework

Page 30: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 30/83

rCUDA uses a proprietary communication protocol Example:

1) initialization 2) memory allocation on the

remote GPU 3) CPU to GPU memory

transfer of the input data 4) kernel execution 5) GPU to CPU memory

transfer of the results 6) GPU memory release 7) communication channel

closing and server process finalization

Basics of the rCUDA framework

Page 31: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 31/83

Interception library Server daemon

Application

Network interface Network interface

CUDA library

Client side Server side

Communications Communications

TCP IB TCP IB

Basics of the rCUDA framework

rCUDA features a modular architecture

Page 32: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 32/83

Use of GPUDirect to avoid unnecessary memory copies

Pipeline to improve performance

Preallocated pinned memory buffers

Optimal pipeline block size

Basics of the rCUDA framework

rCUDA features optimized communications:

Page 33: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 33/83

Pipeline block size for InfiniBand FDR

• NVIDIA Tesla K20; Mellanox ConnectX-3 + SX6025 Mellanox switch

Basics of the rCUDA framework

It was 2MB with IB QDR

Page 34: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 34/83

Basics of the rCUDA framework

Host-to-Device pinned memory

Bandwidth of a copy between GPU and CPU memories

Host-to-Device pageable memory

Page 35: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 35/83

Latency study: copy a small dataset (64 bytes)

Basics of the rCUDA framework

Page 36: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 36/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors

Page 37: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 37/83

Test system: Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture) GPU NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) SX6025 Mellanox switch Cisco switch SLM2014 (1Gbps Ethernet) CentOS 6.3 + Mellanox OFED 1.5.3

Performance of the rCUDA framework

Page 38: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 38/83

BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI0

1

2

3

4

5

6

7

8

9

CUDArCUDA FDRrCUDA Gb

Nor

mal

ized

tim

e

Test CUDA SDK examples

Performance of the rCUDA framework

Page 39: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 39/83

CUDASW++ Bioinformatics software for Smith-Waterman protein database searches

144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 547805

1015202530354045

024681012141618

FDR Overhead QDR Overhead GbE Overhead CUDArCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhea

d (%

)

Exe

cutio

n Ti

me

(s)

Performance of the rCUDA framework

Page 40: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 40/83

GPU-Blast Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool

2 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 14000

50

100

150

200

250

300

350

0

10

20

30

40

50

60

FDR Overhead QDR Overhead GbE Overhead CUDArCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhea

d (%

)

Exe

cutio

n Ti

me

(s)

Performance of the rCUDA framework

Page 41: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 41/83

Test system for MultiGPU For CUDA tests one node with:

2 Quad-Core Intel Xeon E5440 processors Tesla S2050 (4 Tesla GPUs) Each thread (1-4) uses one GPU

For rCUDA tests 8 nodes with: 2 Quad-Core Intel Xeon E5520 processors 1 NVIDIA Tesla C2050 InfiniBand QDR Test running in one node and using up to all the GPUs of the others Each thread (1-8) uses one GPU

Performance of the rCUDA framework

Page 42: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 42/83

MonteCarlo MultiGPU (from NVIDIA SDK)

Performance of the rCUDA framework

Page 43: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 43/83

GEMM MultiGPU using libflame: high performance dense linear algebra library

http://www.cs.utexas.edu/~flame/web/

Performance of the rCUDA framework

Page 44: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 44/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors

Page 45: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 45/83

• SLURM (Simple Linux Utility for Resource Management) job scheduler

• Does not understand about virtualized GPUs • Add a new GRES (general resource) in order to manage

virtualized GPUs • Where the GPUs are in the system is completely

transparent to the user • In the job script or in the submission command, the user

specifies the number of rGPUs (remote GPUs) required by the job

Integrating rCUDA with SLURM

Page 46: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 46/83

Integrating rCUDA with SLURM

Page 47: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 47/83

ClusterName=rcu … GresTypes=gpu,rgpu … NodeName=rcu16 NodeAddr=rcu16 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7990 Gres=rgpu:4,gpu:4

slurm.conf

Integrating rCUDA with SLURM

Page 48: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 48/83

NodeName=rcu16 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) Gres=rgpu:4,gpu:4 NodeAddr=rcu16 NodeHostName=rcu16 OS=Linux RealMemory=7682 Sockets=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime=2013-04-24T18:45:35 SlurmdStartTime=2013-04-30T10:02:04

scontrol show node

Integrating rCUDA with SLURM

Page 49: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 49/83

Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m Name=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535m Name=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3

gres.conf

Integrating rCUDA with SLURM

Page 50: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 50/83

$ $ srun -N1 --gres=rgpu:4:512M script.sh... $

Submit a job

RCUDA_DEVICE_COUNT=4 RCUDA_DEVICE_0=rcu16:0 RCUDA_DEVICE_1=rcu16:1 RCUDA_DEVICE_2=rcu16:2 RCUDA_DEVICE_3=rcu16:3

Integrating rCUDA with SLURM

Environment variables are initialized by SLURM and used by the rCUDA client (transparently to the user)

Server name/IP address : GPU

Page 51: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 51/83

Integrating rCUDA with SLURM

Page 52: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 52/83

Integrating rCUDA with SLURM

GPUs are decoupled from nodes

All jobs are executed in

less time

Page 53: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 53/83

Test bench for the SLURM experiments: Old “heterogeneous” cluster with 28 nodes:

1 node without GPU, with 2 AMD Opteron 1.4GHz, 6GB RAM 23 nodes without GPU with 2 AMD Opteron 1.4GHz, 2GB RAM 1 node without GPU, QuadCore i7 3.5GHz, 8GB de RAM 1 node with GeForce GTX 670 2GB, QuadCore i7 3GHz, 8GB RAM 1 node with GeForce GTX 670 2GB, Core2Duo 2.4GHz, 2GB RAM 1 node with 2 GeForce GTX 590 3GB, Quad Core i7 3.5GHz, 8GB RAM

1Gbps Ethernet Scientific Linux release 6.1 and also Open Suse 11.2 (dual boot)

Integrating rCUDA with SLURM

Page 54: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 54/83

Integrating rCUDA with SLURM matrixMul from NVIDIA’s SDK

matrixA 320x320 matrixB 640x320

Total execution time including SLURM

Page 55: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 55/83

Integrating rCUDA with SLURM matrixMul from NVIDIA’s SDK The higher

performance of rCUDA is due to a

pre-initialized CUDA environment

in the rCUDA server

Total execution time including SLURM

Page 56: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 56/83

Integrating rCUDA with SLURM matrixMul from NVIDIA’s SDK

The time for initializing the

CUDA environment is about 0.9

seconds

Total execution time including SLURM

Page 57: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 57/83

Integrating rCUDA with SLURM matrixMul from NVIDIA’s SDK

Scheduling the use of remote GPUs takes less time

than finding a local available GPU

Total execution time including SLURM

Page 58: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 58/83

Integrating rCUDA with SLURM Sharing remote GPUs among processes (matrixMul)

Total execution time including SLURM

Page 59: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 59/83

Integrating rCUDA with SLURM Sharing remote GPUs among processes

sortingNetworks from NVIDIA’s SDK

Page 60: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 60/83

Integrating rCUDA with SLURM Sharing remote GPUs among processes (CUDASW++)

The GPU memory gets

full with 5 concurrent executions

Page 61: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 61/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors

Page 62: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 62/83

rCUDA and low-power processors

• There is a clear trend to improve the performance-power ratio in HPC facilities: • By leveraging accelerators (Tesla K20 by NVIDIA, FirePro by AMD,

Xeon Phi by Intel, …) • More recently, by using low-power processors (Intel Atom, ARM, …)

• rCUDA allows to attach a “pool of accelerators” to a computing facility: • Less accelerators than nodes accelerators shared among nodes • Performance-power ratio probably improved further

• How interesting is a heterogeneous platform leveraging powerful processors, low-power ones, and remote GPUs? • Which is the best configuration? • Is energy effectively saved?

Page 63: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 63/83

rCUDA and low-power processors

• The computing platforms analyzed in this study are: • KAYLA: NVIDIA Tegra 3 ARM Cortex A9 quad-core CPU (1.4 GHz), 2GB

DDR3 RAM and Intel 82574L Gigabit Ethernet controller

• ATOM: Intel Atom quad-core CPU s1260 (2.0 GHz), 8GB DDR3 RAM and Intel I350 Gigabit Ethernet controller. No PCIe connector for a GPU

• XEON: Intel Xeon X3440 quad-core CPU (2.4GHz), 8GB DDR3 RAM and 82574L Gigabit Ethernet controller

• CARMA: NVIDIA Quadro 1000M GPU (96 cores) with 2GB DDR5 RAM. The rest of the system is the same as for KAYLA

• FERMI: NVIDIA GeForce GTX480 “Fermi” GPU (448 cores) with 1,280 MB DDR3/GDDR5 RAM

• The accelerators analyzed are: CARMA platform

Page 64: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 64/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

Page 65: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 65/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

Page 66: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 66/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

• LAMMPS makes a more intensive use of the GPU than CUDASW++

Page 67: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 67/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

• LAMMPS makes a more intensive use of the GPU than CUDASW++

• The lower performance of the PCIe bus in Kayla reduces performance

Page 68: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 68/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

• LAMMPS makes a more intensive use of the GPU than CUDASW++

• The lower performance of the PCIe bus in Kayla reduces performance

• The 96 cores of CARMA provide a noticeably lower performance than a more powerful GPU

Page 69: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 69/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

CARMA requires

less power

Page 70: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 70/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

When execution time and required power are combined into energy, XEON+FERMI is more

efficient

Page 71: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 71/83

rCUDA and low-power processors • 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

Summing up: - If execution time is a concern, then use XEON+FERMI - If power is to be minimized, CARMA is a good choice

- The lower performance of KAYLA’s PCIe decreases the interest of this option for single system local-GPU usage

- If energy must be minimized, XEON+FERMI is the right choice

Page 72: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 72/83

rCUDA and low-power processors • 2nd analysis: use of remote GPUs

• The use of the CUADRO GPU within the CARMA system is discarded as rCUDA server due to its poor performance. Only the FERMI GPU is considered

• Six different combinations of rCUDA clients and servers are feasible:

• Only one client system and one server system are used at each

experiment

Combination Client Server

1 KAYLA KAYLA+FERMI

2 ATOM KAYLA+FERMI

3 XEON KAYLA+FERMI

4 KAYLA XEON+FERMI

5 ATOM XEON+FERMI

6 XEON XEON+FERMI

Page 73: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 73/83

rCUDA and low-power processors • 2nd analysis: use of remote GPUs CUDASW++

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 74: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 74/83

rCUDA and low-power processors • 2nd analysis: use of remote GPUs CUDASW++

The KAYLA-based server presents

much higher execution time

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 75: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 75/83

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

rCUDA and low-power processors • 2nd analysis: use of remote GPUs

The KAYLA-based server presents

much higher execution time

CUDASW++ The network interface for

KAYLA delivers much less than the

expected 1Gbps bandwidth

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 76: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 76/83

rCUDA and low-power processors • 2nd analysis: use of remote GPUs CUDASW++

The lower bandwidth of

KAYLA (and ATOM) can also be noticed

when using the XEON server

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 77: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 77/83

rCUDA and low-power processors • 2nd analysis: use of remote GPUs CUDASW++

Most of the power and energy is required by the server side, as it

hosts the GPU

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 78: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 78/83

rCUDA and low-power processors • 2nd analysis: use of remote GPUs CUDASW++

The much shorter execution time of

XEON-based servers render more energy-

efficient systems

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 79: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 79/83

rCUDA and low-power processors LAMMPS • 2nd analysis: use of remote GPUs

Combination Client Server 1 KAYLA KAYLA+FERMI

2 KAYLA XEON+FERMI

3 ATOM XEON+FERMI

4 XEON XEON+FERMI

KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 80: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 80/83

rCUDA and low-power processors LAMMPS • 2nd analysis: use of remote GPUs

Similar conclusions to

CUDASW++ apply to LAMMPS, due to the network

bottleneck

KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON KAYLA + FERMI XEON + FERMI

Page 81: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 81/83

Outline

Current GPU computing facilities How GPU virtualization adds value to your cluster The rCUDA framework: basics and performance Integrating rCUDA with SLURM rCUDA and low-power processors rCUDA work in progress

Page 82: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

HPC Advisory Council Spain Conference, Barcelona 2013 82/83

rCUDA work in progress

Scheduling in SLURM taking into account the actual GPU utilization Analysis of the benefits of using the SLURM+rCUDA suite in

real production clusters with InfiniBand fabric

Test rCUDA with more applications from list of “Popular GPU-accelerated Applications” from NVIDIA

Support for CUDA 5.5 Integration of GPUDirect RDMA into rCUDA

communications Further analyze hybrid ARM-XEON scenarios

Use of better network protocols for 1Gbps Ethernet Use of InfiniBand fabrics

Page 83: rCUDA: towards energy-efficiency in GPU computing by ...Top500 June 2013 . HPC Advisory Council Spain Conference, Barcelona 2013 5/83 PCI-e . GPU CPU . GPU. mem Main Memory . Network.

http://www.rcuda.net More than 350 requests

Antonio Peña Carlos Reaño Federico Silla José Duato

Adrian Castelló Sergio Iserte Vicente Roca Rafael Mayo

Enrique S. Quintana-Ortí

rCUDA team

Thanks to Mellanox for its support to this work

(1) Now at Argonne National Lab. (USA)

(1)