ROCm and Hopsworks for end- to-end deep learning pipelines

ROCm and Hopsworks for end-to-end deep learning pipelines

Jim Dowling, CEO, Logical ClocksAjit Mathews, CVP ML Software Eng, AMD

17 October 2019O’Reilly AI, London

Great Hedge of India

● East India Company was one of

the industrial world’s first monopolies.

● They assembled a thorny hedge

(not a wall!) spanning India.

● You paid customs duty to bring salt

over the wall (sorry, hedge).

In 2019, not all graphics cards are allowed

to be used in a Data Center.

Monoplies are not good for deep learning!

[Image from Wikipedia]

Nvidia™ 2080Ti vs AMD Radeon™ VII: ResNet-50

Nvidia™ 2080Ti

Memory: 11GBTensorFlow 1.12

CUDA 10.0.130, cuDNN 7.4.1

Model: RESNET-50Dataset: imagenet (synthetic)

------------------------------------------------------------

FP32 total images/sec: ~322

AMD Radeon™ VII

Memory: 16 GB TensorFlow 1.14

ROCm: 2.7

Model: RESNET-50Dataset: imagenet (synthetic)

------------------------------------------------------------

FP16 total images/sec: ~421References:

https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/

https://www.phoronix.com/scan.php?page=article&item=nvidia-

rtx2080ti-tensorflow&num=2

Reference:

https://github.com/ROCmSoftwarePlatform/tensorflow-

upstream/issues/173

Nvidia™ 2080Ti vs AMD Radeon™ VII: ResNet-50

Data validation

Distributed Training

ModelServing

A/BTesting

Monitoring

Pipeline Management

HyperParameterTuning

Feature Engineering

Data Collection

HardwareManagement

Data Model Prediction

Hopsworks hides the Complexity of Deep Learning

HopsworksREST API

HopsworksFeature Store

[Adapted from Schulley et Al “Technical Debt of ML” ]

DatasourcesApplications

APIDashboards

Hopsworks

Apache BeamApache Spark Pip

Tensorflowscikit-learn

PyTorch

Jupyter Notebooks

TensorboardApache BeamApache SparkApache Flink

Kubernetes

Batch Distributed ML & DL

ModelServing

HopsworksFeature Store

Kafka +Spark

Streaming

ModelMonitoring

Orchestration in Airflow

Data Preparation & Ingestion

Experimentation& Model Training

Deploy & Productionalize

Streaming

Filesystem and Metadata storage

HopsFS

Apache Kafka

Datasources

Scientific AppsMachine Learning

TRANSLATE WEATHER EDUCATION SEISMICPROCESSING

SPEECH RECOMMENDER SIMULATIONS MOLECULARSIMULATIONS

Frameworks

Middleware and Libraries Eigen

Runtimes & Drivers

ROCm Kernel Driver

OpenMP HIP OpenCL™ Python™

Devices GPU CPU APU DL Accelerator

RCCLrocBLASMIOpen

Open Source Foundation for Machine learning

rocFFT rocPRIM

Open source stack

Latest ML Frameworks

Optimized Compilers, Math & Communication Libraries

Write parallel GPU algorithms entirely from Python™

Cluster Scale: Docker™, SLURM and Kubernetes Extensions

Distro: Upstream Linux Kernel Support

1100+ upstream ROCm driver commits since 4.12 kernelhttps://github.com/RadeonOpenCompute/ROCK-Kernel-Driver

Languages: Multiple Programming options

Programming Models

LLVM: https://llvm.org/docs/AMDGPUUsage.htmlCLANG HIP: https://clang.llvm.org/doxygen/HIP_8h_source.html

OpenMP Python™ OpenCL™HIP

LLVM -> AMDGCN Compiler

AMDGPU Code

Machine Learning Frameworks

ROCm enablement upstreamed to mainline and latest version supportedSupport for major applications and benchmarks

Support for fp32 and fp16 precisions

Support for automatic mixed precision via rocPyX (ROCmPyTorch Extensions)

Available today as a Docker™ container or build from source

https://medium.com/tensorflow/community-supported-amd-

rocm-build-for-tensorflow-e8e9ac258369

Available today as a Docker™ container or as Python™ PIPwheel

https://github.com/ROCmSoftwarePlatform/MIOpenhttps://arxiv.org/abs/1910.00078

Helps realize the incredible benefits of high-performance, highly-tuned Deep Learning primitives• Single Precision (FP32), Half Precision (FP16), Mixed Precision and bFloat16 supported• >500 Operators Accelerated. Hand tuned Assembly Operations

ACTIVATIONReLU

SigmoidTANH

Leaky RELU

Dropout

CONVOLUTIONWinograd

FFTDirect GEMM

Implicit GEMM

Dilated Convolutions

Group Convolutions

POOLINGMax, Average

Transpose Convolutions

Embeddings

SoftMax

NORMALIZATIONLRN

Batch Normalization

MIOpen: Open Source Machine Learning Library

Support for Robust AI Compiler Technologies

XLA enabled on ROCm and UpstreamedPerformance improvements realized over classic backend TensorFlow →MLIR

Stream Executor

MLIR MIOpen Dialect

MIOpen with “Composable” Convolutions → LLVM IRMLIR-aware MIOpen

MLIR MIOpen Dialect

+Kernel execution

Eliminate “black-box” nature of Convolutions to drive next-gen performance optimizations

Learn more about MLIR on ROCm at TensorFlow World in 2 weeks

Basic functionality enabled on ROCmUpstreaming in progress

MLIR-based applications

AMD is a launch partner of Google’s MLIR technology

ROCm Performance Improvements>1.5X Year-over-Year Software Performance Improvements on the Same Hardware

Oct '18 Oct '19

tive P

ResNet50_128 Inception3_128

ROCm is on a monthly release cadence packed with new features and performance optimizations released both as source and docker containers

All performance measured on a single MI25 GPU using fp32

Intra-Node and Inter-Node ready

100% 96% 94%92%

1GPU 2GPU 4GPU 8GPU 16GPU 32GPU 64GPU

PyTorch – ResNext101Radeon Instinct (TM) MI50, FP32, BS32, 1Proc/GPU

MI50MI50

IB IB IB IB

100 G Switch

…Eight Nodes with RDMA

MI50MI50

Chordal Ring with Infinity Fabric ~ All to All Communication

100% 97% 96%95%

1GPU 2GPU 4GPU 8GPU 16GPU 32GPU 64GPU

Scaling

Axis Title

PyTorch – ResNet-50 Radeon Instinct™ MI50, FP32, BS 128, 1Proc/GPU

RCCL - Optimized collective communication operations library. Support for Infiniband and RoCE highspeed network fabrics. Designed for Easy MPI integration

Tools: System ManagementROCm Validation Suite (RVS)A cluster management tool for detecting and troubleshooting software and hardware configuration issues, basic diagnostics, integration issues and system performance.

https://github.com/ROCm-Developer-Tools/ROCmValidationSuite

ROCm System Management Interface (SMI)A system administrator’s tool for management and monitoring of GPU devices in ROCm enabled system.

https://github.com/RadeonOpenCompute/ROC-smi

Tools: Debugger and Performance ProfilingROCM DebuggerA multi-architecture x86_64 CPU and amdgcn GPUdebugger. Support for amdgcn ISA level and x86_64debugging via GDB. There will be rocm debugger APIsavailable for 3rd party debugger developers.

ROCM Performance ProfilingrocProfiler allows collection of GPU HW Counter. rocTracer is aondemand tracing with generic runtime’s Events, Callback andActivity API. rocTX provides code annotation markers to profilespecific sections of code.

https://github.com/ROCm-Developer-Tools/rocprofiler

https://github.com/ROCm-Developer-Tools/roctracer

ROCm: Machine Learning Applications

Image Classification Object Detection Machine TranslationRecommendation

Systems

All state-of-the-art models are enabled on ROCm using TensorFlow & PyTorch

ResNext101 (fp32) Batch=32

63 img/sec

All performance measured on a single MI50 GPU using fp32

Mask-RCNN-Resnet101(fp32) Batch=2, Img:1Kx1K

144 img/sec

BERT66 examples/sec

Neural Collaborative Filtering

2,961,459 tokens/sec

GNMT: LSTMsTranslate: LSTMsBERT: TransformerGPT-2: Transformer

ResNet50, ResNet101VGGResNext101Inception3, Inception4ShuffleNet, DenseNetMobileNet, SqueezeNet…

Faster-RCNN-Resnet50Mask-RCNN-Resnet50SSD-Resnet50…

DLRMNCF

NEW ERA OF GPU-ACCELERATED MACHINE LEARNING

OPEN SOFTWARE PLATFORMFOR COMPUTE

Enabling Innovation,Collaboration, and Efficiency

Call to Action: Download latest ROCm release and contribute

Optimized for HPC and Deep Learning at Scale

ROCm in Hopsworks ML Pipelines

Spark / TensorFlow Applications on ROCm

Goal: Spark / TensorFlow applications in Hopsworks

should run unchanged on ROCm

Solution: Hopsworks runs Spark/TensorFlow

on YARN with support for ROCm through CGroups

YARN support for ROCm in Hopsworks

Container

A Container

is a CGroup that

isolates CPU,

memory, and

GPU resources

and has a conda

environment and

TLS certs.

ContainerContainerContainer

Resource Manager

NodeManager

Executor ExecutorExecutorDriver

Kubernetes support for Jupyter/ROCm in Hopsworks

Container

KubeMaster

JupyterHopsworks YARNDistributed ML with

Spark-on-YARN

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

Worker

End-to-End ML PipelinesRaw Data

Event Data

Monitor

Model Serving

Data Prep

Data Lake

FeatureStore

Train &Validate

Data-Intensive GPUs Scale In/Out

End-to-End ML Pipeline Technologies in HopsworksRaw Data

Event Data

Monitor

Model Serving

Data Prep

Data Lake

FeatureStore

Train &Validate

AirflowAirflow

ML Pipelines of Jupyter Notebooks with Airflow

Select

Features,

File Format

Feature

Engineering

Validate &

Deploy Model

Experiment,

Train Model

Dataprep Pipeline Training and Deployment Pipeline

Feature

Don’t just Throw Away Notebooks

Explore Data,

Train modelHyperparam

Distributed

Training

Notebook Python + YML in Gitport/rewrite code

Iteration is hard/impossible/a-bad-idea

PySpark Notebooks as Jobs in ML Pipelines

Apache Airflow to Orchestrate ML Pipelines

Apache Airflow to Orchestrate ML PipelinesAirflow

Jobs REST API

Hopsworks Jobs:

PySpark, Spark,

Flink, Beam/Flink

Distributed Deep Learning with ROCm in Hopsworks

HParam Tuning with Spark+TensorFlow

Executor 1 Executor N

Driver

HopsFS

TensorBoard ModelsCheckpoints Training Data Logs

conda_env

conda_env conda_env

# RUNS ON THE EXECUTORS

def train(lr, dropout):

def input_fn(): # return dataset

optimizer = …

model = …

model.add(Conv2D(…))

model.compile(…)

model.fit(…)

model.evaluate(…)

# RUNS ON THE DRIVER

Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]}

experiment.grid_search(train,HParams)

https://github.com/logicalclocks/hops-examples

Distributed Training with Spark+TensorFlow

https://github.com/logicalclocks/hops-examples

# RUNS ON THE EXECUTORS

def train():

def input_fn(): # return dataset

model = …

optimizer = …

model.compile(…)

rc = tf.estimator.RunConfig(‘CollectiveAllReduceStrategy’)

keras_estimator = tf.keras.estimator.model_to_estimator(….)

tf.estimator.train_and_evaluate(keras_estimator, input_fn)

# RUNS ON THE DRIVER

experiment.collective_all_reduce(train)

Executor 1 Executor N

Driver

HopsFS

TensorBoard ModelsCheckpoints Training Data Logs

conda_env

conda_env conda_env

DEMOEnd-to-End ML Workflow with Spark/TensorFlow and ROCm

That was Hopsworks with ROCm

Efficiency & Performance Security & GovernanceDevelopment & Operations

Secure Multi-Tenancy

Project-based restricted access

Encryption At-Rest, In-Motion

TLS/SSL everywhere

AI-Asset Governance

Models, experiments, data, GPUs

Data/Model/Feature Lineage

Discover/track dependencies

Development Environment

First-class Python Support

Version Everything

Code, Infrastructure, Data

Model Serving on Kubernetes

TF Serving, SkLearn

End-to-End ML Pipelines

Orchestrated by Airflow

Feature Store

Data warehouse for ML

Distributed Deep Learning

Faster with more GPUs

HopsFS

NVMe speed with Big Data

Horizontally Scalable

Ingestion, DataPrep,

Training, Serving

@logicalclocks

www.logicalclocks.com

O P E N S O U R C E

F O U N D A T I O N

F O R M A C H I N E

L E A R N I N G

Acknowledgements and References

Slides and Diagrams from colleagues:

● Maggy: Moritz Meister and Sina Sheikholeslami

● Feature Store: Kim Hammar

● Beam/Flink on Hopsworks: Theofilos Kakantousis

References● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017.

● Size matters: Improving the performance of small files …, ACM Middleware 2018.

● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.

● Hopsworks Demo, SysML 2019.

Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical

errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and

roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing

manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise

this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without

obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY

INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT

WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF

ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their

respective owners. Use of third party marks / names is for informational purposes only and no endorsement of or by AMD is intended or implied.

ROCm and Hopsworks for end- to-end deep learning pipelines

Documents

Transcript of ROCm and Hopsworks for end- to-end deep learning pipelines

Hopsworks Secure Spark/Flink/Kafka-as-a-Servicebada.sics.se/wp-content/uploads/2017/12/HopsWorks.pdf2017-12-22 Berlin Buzzwords, Hopsworks, J Dowling, June 2017 9/48 Share any Data

End-to-End Services in WiMAX - gtllimited.comcorporate_forum\GTL-Sharat_Chandra-End-to-End... · End-to-End Services in WiMAX An Independent Service provider’s perspective Sharat

It's Time to ROCm!

ROCm: Leveraging Open Source Software for Accelerating ... · document. All open source software listed in this presentation is governed by the associated open source license. Terms

Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

, P.P. SHANG Beijing, China; Department Received June 26 ... · PDF fileAnticoagulation/antithrombus therapy might be helpful in the management of ROCM if potent antifun-gal treatment

AMD’S ROCM SOFTWARE HELPS ACCELERATE HPC ......AMD’s software stack had been optimized for the Microsoft Windows environment. AMD has revamped its software stack for Linux HPC

Multi-tenant Flink as-a-service with Kafka on Hopsworks

HSS - Future Supreme HSS END MILLS.pdfhss end mills cbn end mills i-xmill end mills x5070 end mills x-power end mills jet-power end mills v7 mill inox end mills v7 mill steel end mills

{Khaled Hamidouche, Brad Benton}@AMD Research€¦ · {Khaled Hamidouche, Brad Benton}@AMD Research UCX-ROCm: ROCm Integration into UCX. 2 JUNE, 2018 | ISC ROCm Software Platform

LENOVO YOGA 500 TABLET + PC IN ONE DEVICEimages.icecat.biz/objects/mmo_31980499_1462362782_3264_17855… · Lenovo ® rocm dss Lenovo® recommends Windows. 0800 123 1234 LENOVO®

ROCm - GPGPU-10gpgpu10.athoura.com/ROCM_GPGPU_Keynote.pdf · · 2017-02-12‒ Support C++11, C++14, ... Many core structures similar to C++AMP ... ROCM ISO C++17 Parallel STL

WHITE PAPER DER End-End Assurance: California’s Rule 21 ... End-End Assurance Califor… · White Paper: DER End-End Assurance 8 Does the CA Rule 21 Plan Guarantee End-End Performance?

Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

AMD ROCm GPU profiling in Trace Compass › system › files › Progress...AMD ROCm GPU profiling in Trace Compass Arnaud Fiorini with Pr. Michel Dagenais May 8th, 2020 Polytechnique

nomad-digital › wp-content › uploads › 2019 › 11 › Insight-CBM.pdfnomad-digital.com CBM/ROCM These Nomad solutions and know-how merge Remote Condition Monitoring, reliability

orthwlest chapter 0000· · orthwlest chapter APRIL 1983 NUMBER 251 PACIFIC T CHAPTER NATIONAL .RAl:I.WAY HISTORICAL SOCIETY 'ROCM 1, 0000· STATICN PORl'UIND, OREroN 97209 PHONE

USS Mahan › archive › Jul70NL.pdfJune 1 and 1 MAHAN heZd the pa't.tg Gay Rocm o D' the C Lub ÅZZZance and wad the moat enjoyabl_e 1 have get attended. and a bud 6 at 06 Southe.tn

HopsWorks Digital Infrastructure · Hops is simplifying Hadoop through a new scalable, highly available, customizable metadata architecture. It allows organizations to host multiple

ICT PRACTICAL QUESTIONS Standard X · 80% Home Exam documents-oi Table 4.ots.aô Qocmv IF Function Eligible Not Eligible Home Exam10 Home Exam 10 rocm : World Environment Day 2012