NVIDIA GPUs on OpenShift Deep Learning Workloads with · Deep Learning Workloads with NVIDIA GPUs...
Transcript of NVIDIA GPUs on OpenShift Deep Learning Workloads with · Deep Learning Workloads with NVIDIA GPUs...
Deep Learning Workloads with NVIDIA GPUs on OpenShift
28 October, 2019
Mayur ShettySenior Solutions Architect, Red Hat
Mehnaz MahbubCluster Systems Engineer, Supermicro Inc.
1
Agenda
2
● ML Pipeline and Key Personas● Why Containers & Kubernetes in Hybrid Cloud for AI/ML workloads?● Why OpenShift and Hybrid Cloud for ML workloads● How to use GPUs with OpenShift● Solution building blocks ● Cluster overview/ network topology● Benchmark Suite ● Benchmark Results
3
ML Pipeline & Key Personas
Data Acquisition & Preparation
ML Modelling (Selection, Training,
Testing)
ML Model Deployment in
App. Dev. Process
Data Engineer
Data Scientists
App Developer
IT Operations
BusinessObjectives
Data
Business Leadership
Business Leadership
Intelligent applicationsto achieve
business outcomes
Why Containers & Kubernetes in Hybrid Cloud for AI/ML workloads?
4
Agility across the ML pipeline ● Automated install and provisioning ● Autoscaling ● GPU acceleration, scaling, security,
uptime
1
Portability & flexibility for ML powered apps
● Develop/deploy ML apps across data center, edge, and public clouds
● Offer ML-as-a-service 2
Red Hat products & services help solve additional challenges
● Automation, CI/CD drive collaboration● Boost productivity ● Data access, prep, & governance● Apps lifecycle management &
operations
3
Why OpenShift And Hybrid Platforms for ML Workloads?
5
EXISTING AUTOMATION
TOOLSETS
SCM(GIT)
CI/CD
SERVICE LAYER
PERSISTENTSTORAGE
REGISTRY
RHEL
NODE
c
RHEL
NODE
RHEL
NODE
RHEL
NODE
RHEL
NODE
RHEL
NODE
C
C
C C
C
C
C CC C
RED HATENTERPRISE LINUX
MASTER
API/AUTHENTICATION
DATA STORE
SCHEDULER
HEALTH/SCALING
PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID
DATA SCIENTIST
ML deployed across clouds, data center,
and edge
ML services, load balanced
and scaled
ML microservices scheduled and
orchestrated on shared resources
Best of SDLC
ML in Production
GPU as a service on OpenShift
6
7
Enablement of GPUs in an OpenShift Cluster
CUDA driver (or container)
K8s device plugin for GPU
GPU node_exporter for
Prometheus
Label: GPU
CRIO GPU runtime plugin
● Pre-reqs - Install NVIDIA driver for
RHEL on the GPU host
● Add nvidia-container-runtime-hook
and create hook file
● Run cuda-vector-add container to
verify operation of driver and
container enablement
● Configure OpenShift - Device
Plugin API is enabled by default
● Label the nodes with GPU
● Next, deploy the NVIDIA Device
Plugin
Deploying GPU Workloads onto OpenShift
8
Pod Deployment
Job
Preparing OpenShift for GPU benchmark workloads
9
● Containerize each of the MLPerf Training v0.6 benchmarks○ Create a Dockerfile for the model with MLCC tool from Red Hat
■ Add statements to the Dockerfile to build NVIDIA PyTorch from source■ Add commands to run each of the MLPerf Training benchmark script
● Create a container image for each of the benchmark
● Push the image to Quay.io
● Deploy MLPerf Training benchmark which requires GPU
Deep Learning Benchmarks on Red Hat OpenShift using Supermicro SuperServers
10
Solution Reference Architecture
11
Software Stack Details
Solution Building Blocks
12
Hardware Setup
13
Ten-Node Cluster Overview● 3 Master Nodes● 3 Infra Nodes● 1 Bastion/ LB node● 3 Application nodes
- Includes a GPU node with 8 * Nvidia® Tesla® V100 SXM2 GPUs
Network Topology
About MLPerf and Datasets
14
MLPerf: https://mlperf.org/Coco: http://cocodataset.org/#homeWMT: http://www.statmt.org/wmt14/translation-task.html
Object Detection
Machine Translation
Benchmarking: Object Detection
15
Software • RHEL 7.6• OpenShift 3.11• Pytorch 19.05• Cuda 10.0, Cuda 9.2• Python 3.
MLPerf Training v0.6 Results
Benchmarking: Machine TranslationRecurrent & Non-Recurrent Translation Using GNMT & Transformer
16
MLPerf Training v0.6 Results
OpenShift GUI from the Project
17
Project Outcomes & Result Evaluation
18
Result Validation & Significance:
● First ever MLPerf Benchmark of Red Hat OpenShift
● Deep Learning workload running on OpenShift matches (if not better!) bare metal performance
● Hardware Advantage: Customer gets same training performance at a much lower cost (Better performance/ dollar)
➔ GitLab: https://gitlab.com/opendatahub/gpu-performance-benchmarks➔ Whitepaper: https://www.redhat.com/en/resources/supermicro-deep-learning-openshift-reference-architecture➔ Supermicro OpenShift Solution: https://www.supermicro.com/en/solutions/red-hat-openshift
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHat
Thank You
19