E8270 –INSIDE NVIDIA GPU CLOUD CONTAINERSon-demand.gputechconf.com/gtc-eu/2018/pdf/e8270... · 15...
Transcript of E8270 –INSIDE NVIDIA GPU CLOUD CONTAINERSon-demand.gputechconf.com/gtc-eu/2018/pdf/e8270... · 15...
Chris Kawalek, NVIDIA GPU Cloud Product Team, NVIDIAMichael O’Connor, Optimized Deep Learning Frameworks, NVIDIA
E8270 – INSIDE NVIDIA GPU CLOUD CONTAINERS
2
AGENDA
The Difficulty With Complex Software
Running In Different Environments
Why Containers
Diving Into NGC Deep Learning Containers
Q&A
3
CHALLENGES WITH COMPLEX SOFTWARE
Current DIY GPU-accelerated AI and HPC deployments are complex and time consuming to build, test and maintain
Development of software frameworks by the community is moving very fast
Requires high level of expertise to manage driver, library, framework dependencies
NVIDIA Libraries
NVIDIA Container
Runtime for Docker
NVIDIA Driver
NVIDIA GPU
Applications or
Frameworks
4
NVIDIA GPU CLOUD (NGC)
Over 35 GPU-Accelerated ContainersDeep learning, HPC applications, HPC visualization tools, and partner applications
Innovate in Minutes, Not WeeksPre-configured, ready-to-run
Run AnywhereNVIDIA GPUs on the top cloud providers, NVIDIA DGX Systems, and PCs and workstations
Simple Access to GPU-Accelerated Software
5
A CONSISTENT, HYBRID CLOUD EXPERIENCE ACROSS COMPUTE PLATFORMS
9
������ ���������� ����%������������������ ��� ����� �������!������+��.�������.�����
7
WORK AT SCALE ON AI SUPERCOMPUTERSNGC Containers Run on NVIDIA DGX Systems
8
DEVELOP ON NVIDIA TITAN & NVIDIA QUADRONGC Containers Run on PCs and Workstations with Select NVIDIA GPUs
9
WHY CONTAINERS?
Benefits of Containers:
Simplify deployment of GPU-accelerated software, eliminating time-consuming software integration work
Isolate individual deep learning frameworks and HPC applications
Share, collaborate, and test applications across different environments
9
10
CONTINUAL EXPANSION
October 2017 October 2018
36 containers
10 containers
bigdft
candle
chroma
gamess
gromacs
lammps
lattice-microbes
MILC
namd
pgi
picongpu
relion
vmd
caffe
caffe2
cntk
cuda
digits
mxnet
pytorch
tensorflow
tensorrt
tensorrtserver
theano
torch
index
paraview-holodeck
paraview-index
paraview-optix
chainer
h20ai-driverless
kinetica
mapd
matlab
paddlepaddle
Deep Learning HPC HPC Visualization PartnersNVIDIA/K8s
Kubernetes
on NVIDIA GPUs
10 containersNEW
CONTAINERS
11
USING NGC CONTAINERS
Data Scientists and
ResearchersDevelopers
Eliminate setup time, focus on
science and research
Work with the latest software with
a known good starting point
Sysadmins
Deploy to production
immediately
12
VIRTUAL MACHINES VS. CONTAINERS
Packaging and deployment mechanism for applications
▶ Consistent and reproducible deployment
▶ Lightweight and lower overhead than VMs
▶ Logical isolation from other applications
Motivation
Image credits
13
EXAMPLE NGC CONTAINER WORKFLOW
NVIDIA builds application image composed of layers of files
Image(s) tested and released to NGC repository hosted at URLs like nvcr.io/nvidia/tensorflow
User pulls image to a machine and runs it
Image cached and OS isolated set of resources allocated (container) in which to execute
Data & results accessed as a filesystem volume
NGC
$ docker run nvcr.io/…
101010
14
ANATOMY OF AN NGC CONTAINER IMAGE
ubuntu:16.04
Image Layers (R/O)
f2233041f557
145c1bf7947a
0c395732af81
fb91e851e672
R/W Layer
NVIDIA DeepLearning SDK
NVIDIA CUDA SDK
DL Framework & Source
Examples & Scripts
…
15
ALWAYS UP-TO-DATEMonthly Releases from NVIDIA
18.09 18.08
Supported Platform DGX OS 4.0.1 and 3.1.2+ 3.1.2+ and 2.1.1+
NVIDIA Driver 410 and 384 384
Base Image Ubuntu 16.04 16.04
CUDA 10.0.130 9.0.176
cuBLAS 10.0.130 9.0.425 (aka Patch 4)
cuDNN 7.3.0 7.2.1
NCCL 2.3.4 2.2.1
NVIDIA Optimized Frameworks NVCaffe 0.17.1 for Python 3.5 0.17.1 for Python 2.7
DIGITS 6.1.1 6.1.1
MXNet 1.3 for Python 3.51.2.0+ for Python 2.7 and
Python 3.5
PyTorch 0.4.1++ for Python 3.6 0.4.1+ for Python 3.6
TensorFlow1.10 for Python 2.7 and Python
3.5 (TensorRT 5.0.0)
1.9.0 for Python 2.7 and
Python 3.5 (TensorRT 4.0.1)
TensorRT 5.0.0 4.0.1
TensorRT Server 0.6 0.5
TensorFlow for Jetson 1.10 on JetPack 4.1 for Xavier 1.9.0 on Jetpack 3.2 for TX2
16
CUDA COMPATIBILITY – UPGRADE PATHS
NEW Forward Compatibility Option
Upgrade only user-mode CUDA components*
CUDA Toolkit
and Runtime
CUDA Toolkit
and RuntimeUpgrade
CUDA 9.0
GPU Kernel
Mode Driver –
nvidia.ko
GPU Kernel
Mode Driver –
nvidia.ko
CUDA User
Mode Driver –
libcuda.so
CUDA User
Mode Driver –
libcuda.so
R384 Driver R410 Driver
CUDA 10.0
Upgrade
New compatibility platform upgrade path available
� Use newer CUDA toolkits on older driver installs
� Compatibility only with specific older driver versions
System requirements
� Tesla GPU support only – no Quadro or GeForce
� Only available on Linux
Starting with CUDA 10.0
*requires new ‘cuda-compat-10-0’ package
17
BEST NVIDIA PERFORMANCEOver 12 months, up to 1.8X improvement with mixed-precision on ResNet-50
18
BEST NVIDIA PERFORMANCE2.0X improvement with mixed-precision on ResNet-50 from DGX-1 to DGX2
19
TARGET SYSTEM SETUP
NGC Virtual Machine ImagesNVIDIA Deep Learning for Volta (AWS–EC2 AMI)
Pre-installedUp-to-date Ubuntu Server OS
CUDA DriversNVIDIA Container Runtime
NGC Container Ready BaseOSOn all DGX Systems
Self-Install Setup Guide
NGC Examples and Management Scriptshttps://github.com/nvidia/ngc-examples
20
LOG INTO NGC, PULL AND RUN11
33 Browse For Image
Create Account / Log In
22 Get API Key
Log in on Machine & Run
$ docker login nvcr.io
Username: $oauthtoken
Password: *******
$ docker run -it nvcr.io/nvidia/tensorflow:18.09
44
21
RUNNING CONTAINERS WITH DATA
101010
nvcri.io/nvidia/tensorflow:18.02
/mnt/ssd/large_dataset/workspace/large_dataset
$ docker run –-rm –it nvcr.io/… -volume /mnt/ssd/large_dataset:/workspace/large_dataset
22
NVIDIA GPU CLOUDGPU-Accelerated Containers for Deep Learning, HPC, and HPC Visualization
Innovate In Minutes, Not Weeks
Run Anywhere
Comprehensive Library of GPU-Accelerated Containers
23
Q & A