How many GPUs do we really need - ESO many GPUs do we really need ? ... ๏ 10k slopes ~0.2Mb and 5k...

How many GPUs do we really need ?

Arnaud Sevin, Damien Gratadour & Julien BruleLESIA, Observatoire de Paris

Outline

๏ Possible GPU-based architectures

๏ GPU environment

๏ Measuring latency and jitter

๏ Problem scale

๏ Pure compute performance

๏ Jitter

๏ Future works

GPU-based architectures๏ GPUs : accelerators / coprocessors = device in a host (CPU)

๏ Connection via PCIe lanes (x16 Gen3 :128 Gb/s)

๏ Several setups for full bandwidth

๏ Desktop : single socket (40 PCIe lanes) + 1/2 GPUs

๏ Workstation : dual-socket + 2/4 GPUs

๏ Cluster : proprietary dual-socket nodes (2-3 GPUs) + infiniband interconnect (40Gb/s)

๏ Other setups throughput oriented

๏ PCIe switches behind IOH : up to 8 GPUs for a single IOH (shared bandwidth)

๏ Through IOH : 40Gb/s in & 32Gb/s out (measured on Fermi with Gen2)

๏ New Intel X79 chipset (no IOH) up to 48Gb/s in & out (measured on Kepler with Gen3)

GPU-based architectures๏ NVIDIA, leader in GPGPU : CUDA architecture & toolkit

๏ Tesla Fermi : 16 Streaming Multiprocessor x 32 cores = 512 cuda cores

๏ Tesla Kepler K10 : 8 SM x 192 cores = 1536 cuda cores x 2 GPU chips (+ on-board PCIe switch)

๏ Up to 6GB on board memory

๏ Stream processing

๏ Several kernel launch instances concurrently

๏ 2 copy engines : bi-directional. Asynchronous copy + compute overlap (hide host-to-device / device-to-host memcopy latency + removes device memory limitation)

๏ 1 compute engine queue + concurrent kernels : increase performance, maximize GPU utilization for small kernels (Fermi up to 16 concurrent kernels)

๏ Get as close as possible to theoretical peak throughput : ~1 TFLOPS

GPU environment๏ Choice of OS/kernel

๏ SL 6.3 64bits (2.6.32) / nvidia driver 304.54 (cuda5 drivers)

๏ Based on SL 6.3 (3.2.23-rt37) / nvidia driver 304.54 RT patched with Red Hat Enterprise MRG tools

๏ Multi-GPU : peer-to-peer

๏ Workstation : CUDA provides peer-to-peer communications between GPUs on the same motherboard.

๏ Single IOH / single node : GPUdirect

๏ Limited impact on dual-IOH (30% in bandwidth)

๏ Multi(>2)-GPUs : MPI / openMP + CUDA

๏ MPI support for Unified Virtual Adressing (peer-to-peer)

๏ GPUdirect over infiniband (shared pinned memory from the host)

Measure latency & jitter๏ NVIDIA profiling tools

๏ NVVP (graphical) / nvprof (command line) : profiling at the GPU (device) level

๏ Better understand CPU-GPU interaction, identify bottlenecks or concurrency opportunities

๏ Monitor multi-processor occupancy, optimize code

๏ TAU / PAPI / CUPTI

๏ TAU : Generic tools for heterogeneous systems : profiling at the host level (kernel level profiling available)

๏ PAPI CUDA : provides access to hardware counters on NVIDIA GPUs. Based on NVIDIA CUPTI the Cuda Performance Tools Interface. Provide device level access

๏ Intrinsically intrusive but limited impact

Problem scale๏ E-ELT : 40m telescope

๏ SCAO system : 80x80 subaps with 6x6 pixels

๏ ~ 0.25 Mpix per transfer (1Mb) => ~ 1Gb/s (1kHz)

๏ 10k slopes ~0.2Mb and 5k DM commands

๏ 10k x 5k command matrix : ~1.5Gb

๏ MVM : 163MFLOP => 163 GFLOPS (1kHz)

๏ Multiple WFS / multiple DMs / LGS

๏ MAORY : up to 6 LGS WFS with 80x80 subaps, 12x12 for LGS and 3 Dms

๏ Bandwidth requirement : ~15Gb/s (1kHz)

๏ Command matrix up to ~20Gb

๏ Throughput requirement (1kHz) for MVM : 1.5TFLOPs

Benchmarking๏ NVIDIA Visual Profiler: profiling overhead

๏ Simulating the coreRTC process

0.1 0.175 150 0.1 0.179 82 0.1 0.1176 182 0.1 0.1202 187 0.1 0.1122 157 0.1 0.1178 200 0.1 0.1223 204 0.1 0.1154 199 ...0.1 0.10.7 -0.2 0.1 0.11.2 0.1 0.1 0.1-1.7 -1.2 0.1 0.10.3 1.8 0.1 0.12.1 -0.3 0.1 0.1-0.9 0.1 0.1 0.10.4 1.4 0.1 0.12.0 0.3 ...0.1 0.1-0.5 -0.3 0.1 0.1-1.2 0.9 0.1 0.12.3 -0.2 0.1 0.11.4 -0.1 0.1 0.1-0.8 -1.7 0.1 0.1-1.9 0.1 0.1 0.11.9 0.7 0.1 0.10.2 -2.1 ...

Subap Nphot

X centroids

Y centroids

Pixel data

0.1 0.1-0.5 -0.3 0.1 0.1-1.2 0.9 0.1 0.12.3 -0.2 0.1 0.11.4 -0.1 0.1 0.1-0.8 -1.7 0.1 0.1-1.9 0.1 0.1 0.11.9 0.7 0.1 0.10.2 -2.1 DM commands...

Pure compute performance๏ 80x80 (square pupil, no obstruction => upper limit), 6x6 pixels

๏ 1 GPU 4 streams10ms

๏ 1 GPU 4 streams

๏ Good level ofconcurrency(memcopy latencyhidding + increasedperformance)

๏ Exec. Timedominated byMVM

๏ Centroiding has high multi-processor occupancy

๏ Throughput limited (pixel data copy takes about 500µs)

๏ 1 GPU 8 streams

๏ Slight performance gain (<2.602ms)

๏ Lower kernelconcurrency level(load too small)

๏ Explicit synchronizationmay be required

๏ Pixel copy takes longer(smaller chunks, sub-optimal) >1ms

๏ 1 GPU 16 streams

๏ Same performance

๏ Pixel copy isdispatched over thewhole process

๏ 2 GPUs 4 streams

๏ Increasedthroughput

๏ Pixel copy takes~1ms

๏ Best trade-off

๏ >500 Hz withprofiling overheads

๏ 2 GPUs 8 streams

๏ Lower performance

๏ bad concurrency

๏ Chunks too small

๏ Sub-optimal(low occupancy)

Jitter๏ Need for RT kernels

๏ SL6.3 without / with RT-kernel, patched NVIDIA drivers, CPU shielding and affinity on HP SL390s motherboards (no RT scheduling)

๏ Dramatic reduction of jitter, slight throughput gain

๏ Throughput / jitter ensure real-time operations at > 500Hz for SCAO with 2 GPUs (square pupil, no obstruction)

SL6.3, Non real-time kernel SL6.3, real-time kernel

Jitter๏ Impact of new Intel X79 chipset (PCIe lanes on the socket, no IOH)

๏ Not bandwidth limited so limited gain in performance

๏ Large impact on the single GPU case (why ?)

๏ Lower jitter on dual GPU case

๏ Are CUDA 5.0 & NVIDIA driver fully compatible with this new chipset ?

With IOH (old generation) Without IOH (new generation)

Jitter๏ Using Geforce cards (gamer oriented), to reduce cost

๏ 1Msample : about 1/2h operations at 600Hz

๏ SL6.3 with RT-kernel & options

๏ 2 GPUs (Geforce GTX 690, PCIe switch on board)

๏ 1Msample : about 1/2h operations at 600Hz

๏ SL6.3 with RT-kernel & options

๏ 2 GPUs (Geforce GTX 690, PCIe switch on board)

๏ Large jitter at all scale (several ms)

๏ Pure throughput is ok but jitter incompatible with RT operations

๏ Geforce ok for simulations but not compatible with RT

๏ Using only one GPU gives lower throughput

๏ 2 GPUs : increased jitter on GeForce compensate gain in throughput (not compatible with RT, no point to use 2 GPUs)

๏ Part of jitter (single GPU) may be due to hardware (X79 chipset)

The MAORY case๏ MCAO with 6 LGS WFS 80x80 (baseline = 12x12 pixels) and 3 DMs

๏ 6 dual-GPU nodes: 1/WFS (cmd vector integrated out of the nodes)

๏ Close to 200Hz

Conclusions๏ Throughput is there

๏ RT operations for SCAO 80x80 @>500Hz (2 GPUs)

๏ CUDA framework allows optimizations (N streams) and memcopy latency hiding

๏ Means to reduce jitter

๏ Use of RT kernel with nvidia RT patch

๏ Use of professional Tesla boards (M2090, K10, K20, K20X, …)

๏ Long-run performance (including jitter) compatible with RT operations on Tesla boards

๏ GeForce have similar throughput, but...

๏ Huge jitter (>2ms) → performance not stable

๏ OK for simulations but incompatible with RT

Future works๏ Use host level (kernel level) for profiling:

๏ Absolute performance and jitter measurements

๏ Kernel module for data exchange

๏ The first step is to feed the GPU with a custom kernel module with GPU direct compatibility

๏ Next the GPU can feed a second kernel module (also GPU direct compatible)

๏ Accurate precision of the real jitter

๏ Acquisition interface through serial protocol

๏ Studies on MPI/RT

How many GPUs do we really need - ESO many GPUs do we really need ? ... ๏ 10k slopes ~0.2Mb and 5k...

Documents

Transcript of How many GPUs do we really need - ESO many GPUs do we really need ? ... ๏ 10k slopes ~0.2Mb and 5k...

Absolute Beginners Guide to Half-marathon Training - Run or Walk 5k-10k-Half-marathon (Hedrick)

The event that feeds our Seniors is back for a 4th year! · P. O. Box 309 Boonville, NC 27011 ... Dragon Slayers 5K/10K for Meals–on-Wheels is an annual 5K/10K race through Dobson,

Manual G4B - 5K - 10K Spa - geniuscaraudio.comgeniuscaraudio.com/wp-content/uploads/2015/10/Manual-G4B-5K-10K...• Amplificador mono block clase “D” diseñado para competencia

Stop Cancer 5K/10K 5K Run/Walk Overall Results · PDF fileStop Cancer 5K/10K 5K Run/Walk Overall Results ... 27 CLAUDIA DIAZ Granada Hills, ... 78 CRYSTAL HERNANDEZ Pico Rivera,

2012 Apple Chase 5K 10K Complete Sponsorship Package

Age Category BIB Name 5K Net Time 10K Net Time 15K ... · Age Category BIB Name 5K Net Time 10K Net Time 15K Aggregate Net Time Overall Rank Gender Rank Age Category Rank 30 to 44

10K 5K Courses Mapas1.wdpromedia.com/media/rundisney/pdf/princess/2014/PRN10K5KM… · Title: 10K 5K Courses Map Created Date: 1/22/2014 3:50:41 PM

Table of Contents - World Wide · PDF fileButterfly Valves - Wafer, Lug and Flanged Patterns 50 .. B2220 .. 5K and 10K .. ... Table of Contents . JIS 5K JIS 10K JIS 16K

Bib number Participant name Age Gender Category739 C. Scyphers 44 MALE 10K 325 Caitlyn McDaniel 9 FEMALE 5K 348 Cala Dutton 20 FEMALE 5K 814 calandra murry 47 FEMALE 10K 413 Cameron

Honolulu Triathlon 5k Run, 40K Bike, 10K Run...Honolulu Triathlon Sunday,17 May 2009 6:00 a.m. Start Island of Oahu 5k Run, 40K Bike, 10K Run ***** AWARDS LIST ***** Place Name Rank

Iraqi Dinar Security Features 25k 10k 5k 1k 500 and 250 note

Full Half 10K 8K 5K 2017 Runner’s Agenda · 2017 Runner’s Agenda Full • Half • 10K • 8K • 5K clevelandmarathon.com ... Medals ... RTA Public Transport ...

T epor 2016 MTM Trends · 100K – 535K 50K - 100K 10K – 50K 5K – 10K 1 – 5K SeRviCe aCTiviTy 50K – 205K 10K – 50K 5K – 10K 1K – 5K 1 – 1K D TRug HeRaPy P ROB leMS

NPS Birthday Run Virtual 5K/10K Race Guide

· NAUTISMSOCIETY Ventura County Official Race Guide Sunday, April 29, 2018 10K @ 7:ooAM 5K @ 8:30AM Sunday, April 29, 2018 is our Annual Aut2Run -10K/5K/Fun-K!

RUN 2.5K 5K 10K WALK 2.5K RIDE RUN WALK RIDE ... › downloads › poster_KR2015.pdfRUN 2.5K 5K 10K WALK 2.5K RIDE RUN WALK RIDE EEPILEPSY 705-448-2709 info@katiesrun.ca atte 's at¿e's

Race Date 5K-10K SportLine America December 06, 2014 Overall …timingco.net/wp-content/uploads/2013/10/Carrera-5K-2.pdf · 2018. 11. 6. · December 06, 2014 Race Date Carrera 5K

10K & 5K Courses Map - runDisneyas1.wdpromedia.com/media/rundisney/pdf/wdw-marathon/2014/10K5KMap.pdf · Title: 10K & 5K Courses Map Created Date: 12/4/2013 7:04:41 PM

Going the Social Distance 5K/10K - Ciudad de Angeles · Going the Social Distance 5K/10K April 25, 2020. 5K *10K . Title: Race Bib Author: James Maclaskey Created Date

RUSSIA - Stamp Albumsstampalbums.com/pages/russia1991.pdf · 2011-08-13 · Heino Eller Birth Centenary 1987 Cosmonauts' Day 1987 4K 10K 20K 35K 5K 5K 5K 10 K10 10 RUSSIA. Communist

Honolulu Triathlon 5k Run, 40K Bike, 10K Run...Honolulu Triathlon Sunday,17 May 2009 6:00 a.m. Start Island of Oahu 5k Run, 40K Bike, 10K Run * AWARDS LIST * Place Name Rank