Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive...

32
GTC DC 2019 - DC91209 No Data Center? No Problem! Supporting AI Workgroups with DGX Station and Kubernetes

Transcript of Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive...

Page 1: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

GTC DC 2019 - DC91209

No Data Center? No Problem! Supporting AI Workgroups with DGX Station and Kubernetes

Page 2: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

2

Michael BalintSenior Product ManagerNVIDIA @michaelbalint

Markus WeberSenior Product ManagerNVIDIA @MarkusAtNVIDIA

Page 3: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

3

Topics

➔ Intro to DGX Station

➔ Sharing Your GPU Compute Resource● Basic● Intermediate● Advanced● Futures

➔ Takeaways

Page 4: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

4

NVIDIA DGX STATION

Groundbreaking AI in Your Office

The AI Workstation for Data Science Teams

4

Key Features

1. 4 x NVIDIA Tesla V100 GPU (32GB)

2. 2nd-gen NVLink (4-way)

3. Water-cooled design

4. 3 x DisplayPort (4K resolution)

5. Intel Xeon E5-2698 20-core

6. 256GB DDR4 RAM

2

1

5

4

3

6

Page 5: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

5

Deployment Scenarios

Page 6: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

6

Deployment Scenarios

Page 7: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

7

Today’s Focus

Page 8: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

8

Today’s Focus

Page 9: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

9

Basic Sharing

Page 10: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

10

DGX STATION SOFTWARE STACK

DGX SOFTWARE STACK

Advantages:

Instant productivity with NVIDIA optimized deep learning frameworks

Caffe, Caffe2, PyTorch, TensorFlow, MXNet, and others

Performance optimized across the entire stack

Faster Time-to-Insight with pre-built, tested,and ready to run framework containers

Flexibility to use different versions of libraries like libc, cuDNN in each framework container

Fully Integrated Software for Instant Productivity

10

Page 11: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

11

Using Individual GPUs$ docker run -e NVIDIA_VISIBLE_DEVICES=0,1 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:34:24 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 |

| N/A 36C P0 37W / 300W | 432MiB / 16125MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |

| N/A 36C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

+-----------------------------------------------------------------------------+

Page 12: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

12

Using Individual GPUs$ docker run -e NVIDIA_VISIBLE_DEVICES=0,1 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:34:24 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 |

| N/A 36C P0 37W / 300W | 432MiB / 16125MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |

| N/A 36C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

+-----------------------------------------------------------------------------+

$ docker run -e NVIDIA_VISIBLE_DEVICES=2,3 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:35:13 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |

| N/A 36C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |

| N/A 36C P0 40W / 300W | 0MiB / 16128MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

| No running processes found |

+-----------------------------------------------------------------------------+

Page 13: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

13

Using Individual GPUs$ docker run -e NVIDIA_VISIBLE_DEVICES=0,1 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:34:24 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 |

| N/A 36C P0 37W / 300W | 432MiB / 16125MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |

| N/A 36C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

+-----------------------------------------------------------------------------+

$ docker run -e NVIDIA_VISIBLE_DEVICES=2,3 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:35:13 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |

| N/A 36C P0 38W / 300W | 0MiB / 16128MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |

| N/A 36C P0 40W / 300W | 0MiB / 16128MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

| No running processes found |

+-----------------------------------------------------------------------------+

Page 14: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

14

NVIDIA GPU Cloud (NGC)

Page 15: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

15

Using Individual GPUs

Joe:

docker run -e NVIDIA_VISIBLE_DEVICES=0 --rm nvcr.io/nvidia/pytorch:19.02-py3 python \ /workspace/examples/upstream/mnist/main.py

docker run -e NVIDIA_VISIBLE_DEVICES=1 --rm nvcr.io/nvidia/pytorch:18.11-py2 python \ /workspace/examples/upstream/mnist/main.py

Jane:

docker run --it -e NVIDIA_VISIBLE_DEVICES=2,3 --rm -v /home/jane/data/mnist:/data/mnist nvcr.io/nvidia/tensorflow:19.02-py3

Real-World Execution Examples

Page 16: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

16

Using Individual GPUs

Joe:

docker run -e NVIDIA_VISIBLE_DEVICES=0 --rm nvcr.io/nvidia/pytorch:19.02-py3 python \ /workspace/examples/upstream/mnist/main.py

docker run -e NVIDIA_VISIBLE_DEVICES=1 --rm nvcr.io/nvidia/pytorch:18.11-py2 python \ /workspace/examples/upstream/mnist/main.py

Jane:

docker run --it -e NVIDIA_VISIBLE_DEVICES=2,3 --rm -v /home/jane/data/mnist:/data/mnist nvcr.io/nvidia/tensorflow:19.02-py3

Real-World Execution Examples

Page 17: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

17

“Manual” Sharing

Page 18: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

18

Using VNC

Page 19: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

19

Intermediate Sharing

Page 20: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

20

Data StorageInternal RAID 0 | Internal RAID 5 | External DAS

$ lsblkNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTsda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 487M 0 part /boot/efi└─sda2 8:2 0 1.8T 0 part /sdb 8:16 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdc 8:32 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdd 8:48 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raid

$ sudo configure_raid_array.py -m raid5

$ sudo configure_raid_array.py -m raid0

Page 21: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

21

Data StorageInternal RAID 0 | Internal RAID 5 | External DAS

$ lsblkNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTsda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 487M 0 part /boot/efi└─sda2 8:2 0 1.8T 0 part /sdb 8:16 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdc 8:32 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdd 8:48 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raid

$ sudo configure_raid_array.py -m raid5

$ sudo configure_raid_array.py -m raid0

DASeSATA

Page 22: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

22

Data StorageInternal RAID 0 | Internal RAID 5 | External DAS

$ lsblkNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTsda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 487M 0 part /boot/efi└─sda2 8:2 0 1.8T 0 part /sdb 8:16 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdc 8:32 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdd 8:48 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raid

$ sudo configure_raid_array.py -m raid5

$ sudo configure_raid_array.py -m raid0

DASUSB 3.1 (Gen 2)

Page 23: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

23

Configuring a NFS Cache

/raidUsed for FSC

3 drives5.2 TB

NFS Shared storage

/boot

1.8 TB

/mnt mount pointRemote NFS

storage

Highspeed Ethernet

10GBASE-T (RJ45)

Page 24: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

24

Advanced Sharing

Page 25: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

25

DEMO

Page 26: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

26

AI as a ServiceUse Case #1: Interactive Session

1. User requests 2 GPUs for interactive session via browser

2. Cluster finds 2 free GPUs and spawns container which taps into them

3. User is presented with interactive python notebook

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

x

2x GPUs

Page 27: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

27

AI as a ServiceUse Case #2: ML Workflow

1. User defines pipeline, each step uses a container; submits it to cluster

2. Cluster finds resources for each step of pipeline, spawning necessary containers and tapping into GPUs

3. Results written to disk; user analyzes

1. Preprocess2. Train A3. Train B

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

| | | | |

| | | | |

| | | | |

1

2

3

| | | | || | | | || | | | |

NOTE: Hyperparameter optimization can use a pipeline as well, spawning containers for each operation.

Page 28: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

28

AI as a ServiceUse Case #3: Inference Server

1. User submits to the cluster a deployment with requirements, including redundancy, a model to be served, and a desired URL endpoint

3x replicasTF modelServe on port 8080

2. Cluster serves the model with a container spawned for each replica

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

3. If one of the replicas go down, a new container is automatically spawned to replace it, guaranteeing service

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

GPU GPU GPU GPUGPU GPU GPU GPU

| | | | |

| | | | |

| | | | |

X

Page 29: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

29

Example Tech Stack With K8s

Kubernetes

Kubeflow

NGC Container

Jupyter

NGC Container

NGC Container

NGC Container

NGC Container

NGC Container

NGC Container

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

• Jupyter Notebooks provide an interactive interface to the cluster

• NGC Containers include CUDA and encapsulate DL & ML frameworks that can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

• Kubeflow interfaces with Kubernetes, simplifying the process of creating deployments

• Kubernetes acts as the OS of the cluster, keeping track of hardware resources and scheduling as necessary

Interactive Session

ML Workflow

Inference Server

Page 30: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

30

Deployment Use CasesWhere can it be leveraged?

Many users, many nodesOn-prem

Many users, single nodeOn-prem

Cloud burstingHybrid

Edge/IoTMulti-region

Production Inferencing*

DGX DGX DGX

DGX DGX DGX

DGX DGX DGX

Cluster API

DG

X Station

Cluster API

DGX

DGX

DGX

CSP

DGX

DGX

DGX

DGX

DGX

Jetson TX2

Jetson TX2

Jetson TX2

Jetson TX2

DGX

DGX

DGX

DGX

DGX

Page 31: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

31

DeepOps

• Opinionated defaults, incorporating NVIDIA best-practices

• Highly-modular, organized into components that can be customized and installed ad-hoc.

• Open source, freely-available, but requires some DevOps knowledge to customize & deploy.

• GitHub: https://github.com/NVIDIA/deepops

• Installs latest DGX OS on compute nodes

• Manages firmware, drivers, and other software

• Deploys job scheduling (Kubernetes and/or Slurm)

• Provides logging and monitoring services

• Scripts for additional services (Kubeflow, Dask, etc)

Note: DeepOps can also be used to configure any NVIDIA GPU-Accelerated platform

For cluster deployment and management

Page 32: Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

32

To Summarize

Basic Intermediate Advanced

OS Users Internal Storage DeepOps

SSH External Storage Kubernetes

Docker / Containers NFS Cache Scripts

NGC Scheduling

Manual Scheduling Orchestration

VNC Monitoring