GPU accelerated Cray XC systems: Where HPC meets Big Data

GPU accelerated Cray XC systems: Where HPC meets Big Data

Peter Messmer (NVIDIA), Chris Lindahl (Cray), Sadaf Alam (CSCS)

Cray User Group Meeting 2016, London, May 8 – 12, 2016

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

GPU accelerated Cray XC systems: Where HPC meets Big Data Peter Messmer

3

"Yes," said Deep Thought, "I can do it."

[Seven and a half million years later.... ]

“The Answer to the Great Question... Of Life, the Universe and Everything...

Is... Forty-two,' said Deep Thought, with infinite majesty and calm.”

— Douglas Adams, Hitchhiker’s Guide to the Galaxy

HIGH PERFORMANCE COMPUTING TODAY*

*mostly

7

Accuracy

Latency

HPC

application

Action

Game

Month

Week

Day

Hour

5 min

100 ms

30 ms

10 ms

Sit in it Has Engine Moves Flies

Flight

Simulator

CG Movie

Parameter Space

Exploration,

Approximate models

8

Accuracy

Latency

HPC

application

Action

Game

Month

Week

Day

Hour

5 min

100 ms

30 ms

10 ms

Sit in it Has Engine Moves Flies

Opportunity!

Flight

Simulator

CG Movie

Parameter Space

Exploration,

Approximate models

Explorative Science,

Real-time systems

9

Bonsai With In-Situ Viz On Piz Daint

Presented at SC14, streaming from CSCS/Switzerland to New Orleans Presented at SC14, streaming from CSCS/Switzerland to New Orleans

Compute & Vis on 1024 GPU nodes Live Streaming

J. Bedorf, E. Gaburov, P.Messmer, S. Portegies Zwart

10

Bonsai With In-Situ Viz On Piz Daint

Presented at SC14, streaming from CSCS/Switzerland to New Orleans Presented at SC14, streaming from CSCS/Switzerland to New Orleans

Compute & Vis on 1024 GPU nodes Live Streaming

12

Visualization-enabled supercomputers

http://blogs.nvidia.com/blog/2014/11/19/gpu-in-

situ-milky-way/

CSCS Piz Daint NCSA Blue Waters

Galaxy formation

http://devblogs.nvidia.com/parallelforall/hpc

-visualization-nvidia-tesla-gpus/

ORNL Titan

Molecular dynamics

Cosmology

http://www.sdav-scidac.org/29-

highlights/visualization/66-accelerated-cosmology-

data-anal.html

http://blogs.nvidia.com/blog/2014/11/19/gpu-in-situ-milky-way/









https://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCK6c9KzT4sYCFY8UkgodQM8IpQ&url=https://bluewaters.ncsa.illinois.edu/&ei=FzapVe60Eo-pyATAnqOoCg&bvm=bv.98197061,d.aWw&psig=AFQjCNG6_XCiXpDJ0HATm6rT-rbi0v67CQ&ust=1437239182216512

http://devblogs.nvidia.com/parallelforall/hpc-visualization-nvidia-tesla-gpus/









https://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCIv4w43Z4sYCFcmDkAodQKkKIw&url=https://www.olcf.ornl.gov/titan/&ei=IDypVYvoBsmHwgTA0qqYAg&bvm=bv.97949915,d.Y2I&psig=AFQjCNFBZRN7n7brVSlelqbBCnTlZ96imw&ust=1437240647535095

http://www.sdav-scidac.org/29-highlights/visualization/66-accelerated-cosmology-data-anal.html













13

CO-PROCESSING PARTITIONED

SYSTEM LEGACY

WORKFLOW

Compute+Vis supports multiple workflows

Separate compute & vis system

Communication via file system

Compute and visualization on same GPU

Communication via host-device transfers or memcpy

Different nodes for different roles

Communication via high-speed network

14

EGL Context Management

Top systems support OpenGL under X

EGL: Driver based context management

Support for full OpenGL*, not only GL ES

Available in e.g. VTK, ParaView, Ensight, VMD,..

New opportunities for CUDA/OpenGL** interop

*Full OpenGL in r355.11; **CUDA interop in r358.7

Leaving it to the driver

Tesla GPU

Tesla driver with EGL

ParaView/VMD

X-server

15

Vis Tools embrace OpenGL on EGL

Prior to EGL: X server required for GPU accelerated OpenGL

Full OpenGL on EGL announced at SC16

With EGL: OpenGL without X

Major enabler for GPU rendering in HPC, incl. Cray systems*

Quick adoption by vis tool developers

https://devblogs.nvidia.com/parallelforall/egl-eye-opengl-visualization-without-x-server/

* Requires driver version 358.7 or newer required

Streamlined GPU accelerated off-screen rendering

5/24/2

016














16

Modern OpenGL for HPC viz

VTK supports now OpenGL 3.2

Enables advanced shaders (AO, VXGI, ..)

Some algorithms well suited for

distributed memory rendering

GPU hardware support

Mandatory to access advanced rendering features

Data courtesy Florida Intl University & TACC

17

Advanced Rendering in scientific visualization

Two lights, no shadows

Two lights,

hard shadows, 1 shadow

ray per light

Ambient occlusion + two

lights, 144 AO rays/hit

• Ray tracing offers ambient occlusion lighting, shadows, high quality transparent surfaces

Better insight with visual cues

Courtesy of John Stone, UIUC

18

OpenGL not limited to Rendering Tasks

CUDA->OpenGL typically one-way only

EGL enables lighter weight access to OpenGL

No X server needed

Potential use of OpenGL for rasterization-like problems?

Determine covered “pixels”

3D ordering/occlusion via Z-buffer

Interop goes both ways, esp with EGL

19

OpenGL Rendering Powerhouse OpenGL vs OpenSWR

Big

ger

is b

ett

er

20

Accelerated remote rendering with Video Encoding

Lossy and loss-less (Maxwell +) H264 encoder

Separate unit, does not consume “GPU resources”

Leveraged by commercial, free tools

Available on e.g. Titan

Possible use for non-video data

https://developer.nvidia.com/nvidia-video-codec-sdk

Interactivity over large distances

NICE DCV running on Titan in user space

21

INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

PCIe

Switch

PCIe

Switch

CPU CPU

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory

Unified Memory

CPU

Tesla P100

22 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

Giant leaps

in everything

NVLINK

PAGE MIGRATION ENGINE

PASCAL ARCHITECTURE

CoWoS HBM2 Stacked Mem

K40 Tera

flops

(FP32/FP16)

5

10

15

20

P100

(FP32)

P100

(FP16)

M40

K40

Bi-

dir

ecti

onal BW

(G

B/Sec)

40

80

120

160 P100

M40

K40 Bandw

idth

(G

B/s)

200

400

600

P100

M40 K40

Addre

ssable

Mem

ory

(G

B)

10

100

1000

P100

M40

21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth

3x Higher for Massive Data Workloads Virtually Unlimited Memory Space

10000 800

23

nvGRAPH Accelerated Graph Analytics

nvGRAPH for high performance graph analytics

Deliver results up to 3x faster than CPU-only

Solve graphs with up to 2.5 Billion edges on 1x M40

Accelerates a wide range of graph analytics apps:

developer.nvidia.com/nvgraph

PageRank Single Source Shortest

Path

Single Source Widest

Path

Search Robotic Path Planning IP Routing

Recommendation Engines Power Network Planning Chip Design / EDA

Social Ad Placement Logistics & Supply Chain

Planning

Traffic sensitive routing 0

1

2

3

Itera

tions/

s

nvGRAPH: 3x Speedup

48 Core Xeon E5

nvGRAPH on M40

PageRank on Twitter 1.5B edge dataset

CPU System: 4U server w/ 4x12-core Xeon E5-2697 CPU,

30M Cache, 2.70 GHz, 512 GB RAM

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

GPUs in XC = GPUs elsewhere (almost)

GPU Accelerated Cray XC Systems: Where HPC Meets Big Data CUG 2016 BOF Sadaf Alam, CSCS May 10, 2016

Platform Consolidation: Enabling Workloads & Workflows Diversity

§  Simulations using a range of programming paradigms §  MPI, OpenMP, … §  CUDA, OpenCL, OpenACC, … §  Optimized libraries §  Domain specific libraries §  …

§  In-situ and large-scale visualization tools §  VisIt §  ParaView §  Application specific

§  New domains (e.g. DL)

Continuous Co-design & Integration §  GPUs in Cray XK6/7 and XC environments have come a long way

§  GPUDirect and CUDA aware MPI §  Quick CUDA releases and updates §  Availability of complete toolchain & “user controlled” compute modes and features

§  Potential with containers §  Variety of workloads and workflows with other dependencies (e.g. OS, Python, etc.) §  Data science applications & workloads §  DL solutions and frameworks

§  Emerging technologies from Nvidia readily available on Cray XC

Further Collaboration with Sites

§  Strengthen partnership with Cray and Nvidia … and other CUG sites §  Enable a richer

experiences for users §  Take advantage of

§  Excellent interconnect with GPU aware MPI

§  Flexible programming and execution model

§  Leverage & develop open- source solutions

§  Nvidia DL SDK (accelerated frameworks)

GPU accelerated Cray XC systems: Where HPC meets Big Data

Documents

Transcript of GPU accelerated Cray XC systems: Where HPC meets Big Data