Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive...

GTC DC 2019 - DC91209

No Data Center? No Problem! Supporting AI Workgroups with DGX Station and Kubernetes

Michael BalintSenior Product ManagerNVIDIA @michaelbalint

Markus WeberSenior Product ManagerNVIDIA @MarkusAtNVIDIA

Topics

➔ Intro to DGX Station

➔ Sharing Your GPU Compute Resource● Basic● Intermediate● Advanced● Futures

➔ Takeaways

NVIDIA DGX STATION

Groundbreaking AI in Your Office

The AI Workstation for Data Science Teams

Key Features

1. 4 x NVIDIA Tesla V100 GPU (32GB)

2. 2nd-gen NVLink (4-way)

3. Water-cooled design

4. 3 x DisplayPort (4K resolution)

5. Intel Xeon E5-2698 20-core

6. 256GB DDR4 RAM

Deployment Scenarios

Today’s Focus

Basic Sharing

DGX STATION SOFTWARE STACK

DGX SOFTWARE STACK

Advantages:

Instant productivity with NVIDIA optimized deep learning frameworks

Caffe, Caffe2, PyTorch, TensorFlow, MXNet, and others

Performance optimized across the entire stack

Faster Time-to-Insight with pre-built, tested,and ready to run framework containers

Flexibility to use different versions of libraries like libc, cuDNN in each framework container

Fully Integrated Software for Instant Productivity

Using Individual GPUs$ docker run -e NVIDIA_VISIBLE_DEVICES=0,1 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:34:24 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 |

| N/A 36C P0 37W / 300W | 432MiB / 16125MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

+-----------------------------------------------------------------------------+

Thu Mar 7 23:34:24 2019

+-----------------------------------------------------------------------------+

|-------------------------------+----------------------+----------------------+

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

|=============================================================================|

+-----------------------------------------------------------------------------+

$ docker run -e NVIDIA_VISIBLE_DEVICES=2,3 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:35:13 2019

+-----------------------------------------------------------------------------+

|-------------------------------+----------------------+----------------------+

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

|=============================================================================|

| No running processes found |

+-----------------------------------------------------------------------------+

Thu Mar 7 23:34:24 2019

+-----------------------------------------------------------------------------+

|-------------------------------+----------------------+----------------------+

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

|=============================================================================|

+-----------------------------------------------------------------------------+

$ docker run -e NVIDIA_VISIBLE_DEVICES=2,3 --rm nvidia/cuda nvidia-smi

Thu Mar 7 23:35:13 2019

+-----------------------------------------------------------------------------+

|-------------------------------+----------------------+----------------------+

|===============================+======================+======================|

| 0 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |

+-------------------------------+----------------------+----------------------+

| 1 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

|=============================================================================|

| No running processes found |

+-----------------------------------------------------------------------------+

NVIDIA GPU Cloud (NGC)

Using Individual GPUs

docker run -e NVIDIA_VISIBLE_DEVICES=0 --rm nvcr.io/nvidia/pytorch:19.02-py3 python \ /workspace/examples/upstream/mnist/main.py

docker run --it -e NVIDIA_VISIBLE_DEVICES=2,3 --rm -v /home/jane/data/mnist:/data/mnist nvcr.io/nvidia/tensorflow:19.02-py3

Real-World Execution Examples

Using Individual GPUs

docker run --it -e NVIDIA_VISIBLE_DEVICES=2,3 --rm -v /home/jane/data/mnist:/data/mnist nvcr.io/nvidia/tensorflow:19.02-py3

Real-World Execution Examples

“Manual” Sharing

Using VNC

Intermediate Sharing

Data StorageInternal RAID 0 | Internal RAID 5 | External DAS

$ lsblkNAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTsda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 487M 0 part /boot/efi└─sda2 8:2 0 1.8T 0 part /sdb 8:16 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdc 8:32 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raidsdd 8:48 0 1.8T 0 disk └─md0 9:0 0 5.2T 0 raid0 /raid

$ sudo configure_raid_array.py -m raid5

DASeSATA

DASUSB 3.1 (Gen 2)

Configuring a NFS Cache

/raidUsed for FSC

3 drives5.2 TB

NFS Shared storage

1.8 TB

/mnt mount pointRemote NFS

storage

Highspeed Ethernet

10GBASE-T (RJ45)

Advanced Sharing

AI as a ServiceUse Case #1: Interactive Session

1. User requests 2 GPUs for interactive session via browser

2. Cluster finds 2 free GPUs and spawns container which taps into them

3. User is presented with interactive python notebook

GPU GPU GPU GPUGPU GPU GPU GPU

2x GPUs

AI as a ServiceUse Case #2: ML Workflow

1. User defines pipeline, each step uses a container; submits it to cluster

2. Cluster finds resources for each step of pipeline, spawning necessary containers and tapping into GPUs

3. Results written to disk; user analyzes

1. Preprocess2. Train A3. Train B

| | | | |

| | | | || | | | || | | | |

NOTE: Hyperparameter optimization can use a pipeline as well, spawning containers for each operation.

AI as a ServiceUse Case #3: Inference Server

1. User submits to the cluster a deployment with requirements, including redundancy, a model to be served, and a desired URL endpoint

3x replicasTF modelServe on port 8080

2. Cluster serves the model with a container spawned for each replica

3. If one of the replicas go down, a new container is automatically spawned to replace it, guaranteeing service

| | | | |

Example Tech Stack With K8s

Kubernetes

Kubeflow

NGC Container

Jupyter

NGC Container

• Jupyter Notebooks provide an interactive interface to the cluster

• NGC Containers include CUDA and encapsulate DL & ML frameworks that can be run as interactive sessions (w/ Jupyter), workflows (involving multiple containers), or model serving

• Kubeflow interfaces with Kubernetes, simplifying the process of creating deployments

• Kubernetes acts as the OS of the cluster, keeping track of hardware resources and scheduling as necessary

Interactive Session

ML Workflow

Inference Server

Deployment Use CasesWhere can it be leveraged?

Many users, many nodesOn-prem

Many users, single nodeOn-prem

Cloud burstingHybrid

Edge/IoTMulti-region

Production Inferencing*

DGX DGX DGX

Cluster API

X Station

Cluster API

Jetson TX2

DeepOps

• Opinionated defaults, incorporating NVIDIA best-practices

• Highly-modular, organized into components that can be customized and installed ad-hoc.

• Open source, freely-available, but requires some DevOps knowledge to customize & deploy.

• GitHub: https://github.com/NVIDIA/deepops

• Installs latest DGX OS on compute nodes

• Manages firmware, drivers, and other software

• Deploys job scheduling (Kubernetes and/or Slurm)

• Provides logging and monitoring services

• Scripts for additional services (Kubeflow, Dask, etc)

Note: DeepOps can also be used to configure any NVIDIA GPU-Accelerated platform

For cluster deployment and management

To Summarize

Basic Intermediate Advanced

OS Users Internal Storage DeepOps

SSH External Storage Kubernetes

Docker / Containers NFS Cache Scripts

NGC Scheduling

Manual Scheduling Orchestration

VNC Monitoring

Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive...

Documents

Transcript of Workgroups with DGX Station and Kubernetes No Data Center ...€¦ · can be run as interactive...

Anaconda/Jupyter Notebook

Demonstration using Jupyter R

DGXUPDATE - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s... · DGX Station DGX-1 NVIDIA GPU Cloud with Tesla P100 with Tesla V100. 5 NVIDIA DGX unlocks the full potential

Jupyter at NERSC€¦ · I am extremely happy with the jupyter interface to Cori. “New jupyter notebooks are awesome!” “I really like the jupyter interface.” “Great interactive

Jupyter Notebook Documentation

Analyzing CBT Benchmarks in Jupyter

NVIDIA DGX OS SERVER VERSION 3.1docs.nvidia.com/dgx/pdf/DGX-OS-server-3.1.4-relnotes-update-guide... · NVIDIA DGX OS Server Version 3.1.4 Release Notes ii . TABLE OF CONTENTS ...

Manual DGX 640 Yamaha

NVIDIA DGX A100 SUPERPOD 소개

Jupyter for Data Science - The Eye · 2020. 6. 22. · Jupyter concepts Jupyter is organized around a few basic concepts: Notebook: A collection of statements (in a language). For

Effective Workgroups

DGX-2/2H SYSTEM - docs.nvidia.com · DGX-2/2H System User Guide 5 . INTRODUCTION TO THE NVIDIA DGX-2/2H SYSTEM . The NVIDIA® DGX-2™ System is the world’s first two-petaFLOPS

nvidia Dgx Os Server Version 3.1 · DGX-1 BMC 3.12.30 Released versions for DGX ... NVIDIA DGX OS Server Version 3.1.2 Release ... The Virtual Media window shows that the ISO image

Gluing Galaxies: A Jupyter Notebook and Glue Tutorial · Gluing Galaxies: A Jupyter Notebook and Glue Tutorial _____ Abstract: This tutorial will be utilizing an app called “Jupyter

NVIDIA DGX OS 5Chapter 1. DGX OS 5 Software DGX OS is a customized Linux distribution based on Ubuntu Linux. It includes platform - specific configurations, diagnostic and monitoring

DGX Software for Red Hat Enterprise Linux 7 · Enterprise Linux 7 - Installation Guide or the DGX Software for CentOS - Installation Guide. Updating the Software To update your DGX

Project Jupyter: From Computational Notebooks to Large ... · The Jupyter Notebook • Project Jupyter () started in 2014 as a spinoff of IPython • Flagship application is the Jupyter

Yamaha Dgx-530 (User Guide)

Clean code in Jupyter notebooks

96514 DGX Catalogue