NVIDIA GPUs on OpenShift Deep Learning Workloads with · Deep Learning Workloads with NVIDIA GPUs...

Deep Learning Workloads with NVIDIA GPUs on OpenShift

28 October, 2019

Mayur ShettySenior Solutions Architect, Red Hat

Mehnaz MahbubCluster Systems Engineer, Supermicro Inc.

1

Agenda

2

● ML Pipeline and Key Personas● Why Containers & Kubernetes in Hybrid Cloud for AI/ML workloads?● Why OpenShift and Hybrid Cloud for ML workloads● How to use GPUs with OpenShift● Solution building blocks ● Cluster overview/ network topology● Benchmark Suite ● Benchmark Results

3

ML Pipeline & Key Personas

Data Acquisition & Preparation

ML Modelling (Selection, Training,

Testing)

ML Model Deployment in

App. Dev. Process

Data Engineer

Data Scientists

App Developer

IT Operations

BusinessObjectives

Data

Business Leadership

Business Leadership

Intelligent applicationsto achieve

business outcomes

Why Containers & Kubernetes in Hybrid Cloud for AI/ML workloads?

4

Agility across the ML pipeline ● Automated install and provisioning ● Autoscaling ● GPU acceleration, scaling, security,

uptime

1

Portability & flexibility for ML powered apps

● Develop/deploy ML apps across data center, edge, and public clouds

● Offer ML-as-a-service 2

Red Hat products & services help solve additional challenges

● Automation, CI/CD drive collaboration● Boost productivity ● Data access, prep, & governance● Apps lifecycle management &

operations

3

Why OpenShift And Hybrid Platforms for ML Workloads?

5

EXISTING AUTOMATION

TOOLSETS

SCM(GIT)

CI/CD

SERVICE LAYER

PERSISTENTSTORAGE

REGISTRY

RHEL

NODE

c

RHEL

NODE

RHEL

NODE

RHEL

NODE

RHEL

NODE

RHEL

NODE

C

C

C C

C

C

C CC C

RED HATENTERPRISE LINUX

MASTER

API/AUTHENTICATION

DATA STORE

SCHEDULER

HEALTH/SCALING

PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID

DATA SCIENTIST

ML deployed across clouds, data center,

and edge

ML services, load balanced

and scaled

ML microservices scheduled and

orchestrated on shared resources

Best of SDLC

ML in Production

GPU as a service on OpenShift

6

7

Enablement of GPUs in an OpenShift Cluster

CUDA driver (or container)

K8s device plugin for GPU

GPU node_exporter for

Prometheus

Label: GPU

CRIO GPU runtime plugin

● Pre-reqs - Install NVIDIA driver for

RHEL on the GPU host

● Add nvidia-container-runtime-hook

and create hook file

● Run cuda-vector-add container to

verify operation of driver and

container enablement

● Configure OpenShift - Device

Plugin API is enabled by default

● Label the nodes with GPU

● Next, deploy the NVIDIA Device

Plugin

Deploying GPU Workloads onto OpenShift

8

Pod Deployment

Job

Preparing OpenShift for GPU benchmark workloads

9

● Containerize each of the MLPerf Training v0.6 benchmarks○ Create a Dockerfile for the model with MLCC tool from Red Hat

■ Add statements to the Dockerfile to build NVIDIA PyTorch from source■ Add commands to run each of the MLPerf Training benchmark script

● Create a container image for each of the benchmark

● Push the image to Quay.io

● Deploy MLPerf Training benchmark which requires GPU

Deep Learning Benchmarks on Red Hat OpenShift using Supermicro SuperServers

10

Solution Reference Architecture

11

Software Stack Details

Solution Building Blocks

12

Hardware Setup

13

Ten-Node Cluster Overview● 3 Master Nodes● 3 Infra Nodes● 1 Bastion/ LB node● 3 Application nodes

- Includes a GPU node with 8 * Nvidia® Tesla® V100 SXM2 GPUs

Network Topology

About MLPerf and Datasets

14

MLPerf: https://mlperf.org/Coco: http://cocodataset.org/#homeWMT: http://www.statmt.org/wmt14/translation-task.html

Object Detection

Machine Translation

https://mlperf.org/

http://cocodataset.org/#home

http://www.statmt.org/wmt14/translation-task.html

Benchmarking: Object Detection

15

Software • RHEL 7.6• OpenShift 3.11• Pytorch 19.05• Cuda 10.0, Cuda 9.2• Python 3.

MLPerf Training v0.6 Results

Benchmarking: Machine TranslationRecurrent & Non-Recurrent Translation Using GNMT & Transformer

16

MLPerf Training v0.6 Results

OpenShift GUI from the Project

17

Project Outcomes & Result Evaluation

18

Result Validation & Significance:

● First ever MLPerf Benchmark of Red Hat OpenShift

● Deep Learning workload running on OpenShift matches (if not better!) bare metal performance

● Hardware Advantage: Customer gets same training performance at a much lower cost (Better performance/ dollar)

➔ GitLab: https://gitlab.com/opendatahub/gpu-performance-benchmarks➔ Whitepaper: https://www.redhat.com/en/resources/supermicro-deep-learning-openshift-reference-architecture➔ Supermicro OpenShift Solution: https://www.supermicro.com/en/solutions/red-hat-openshift

https://gitlab.com/opendatahub/gpu-performance-benchmarks

https://www.redhat.com/en/resources/supermicro-deep-learning-openshift-reference-architecture

https://www.supermicro.com/en/solutions/red-hat-openshift

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHat

Thank You

19

NVIDIA GPUs on OpenShift Deep Learning Workloads with · Deep Learning Workloads with NVIDIA GPUs...

Documents

Transcript of NVIDIA GPUs on OpenShift Deep Learning Workloads with · Deep Learning Workloads with NVIDIA GPUs...