Download - Tesla Master Deck · 2020-01-29 · DeepStream SDK Decoding NVIDIA METROPOLIS Intelligent Video Analytics for Smart Cities RETRAIN WITH NEW DATA` QUERY VIDEO FEED Frames detecting

2

GRAND CHALLENGES REQUIRE MASSIVE COMPUTING

MEDICAL IMAGINGASTROPHYSICS NUCLEAR ENERGYAUTONOMOUS DRIVING FRAUD DETECTION TELECOM

3© 2018 VMware, Inc.

OPTIMIZATIONINSTALLATION

Complex, time

consuming, and error-

prone

Requires expertise to

optimize framework

performance

IT can’t keep up with

frequent software

upgrades

Users limited to older

features and lower

performance

CHALLENGES UTILIZING AI & HPC SOFTWARE

MAINTAINENCEPRODUCTIVITYEXPERTISE

Building AI-centric

solutions requires

expertise


OPTIMIZED SOFTWARE

FASTER DEPLOYMENTS

Eliminates installations.

Simply Pull & Run the

app

Key DL frameworks

updated monthly for perf

optimization

Empowers users to

deploy the latest versions

with IT support

Better Insights and faster

time-to-solution

NGC – SIMPLIFYING AI & HPC WORKFLOWS

ZERO MAINTENANCE

HIGHER PRODUCTIVITY

EMBEDDING EXPERTISE

Deliver greater value,

faster


NGC: APPLICATIONS TO END-TO-END SOLUTIONSW

ork

flow

/Valu

e C

hain

Containers Containers Models

Model Scripts

Classification | Object Detection | NLP/Translation

Text to Speech | Recommender

Industry Solutions

Smart Cities | Medical Imaging

Training SDK Deployment SDKKubeFlow

Pipelines

Proficiency of Skills

Advanced ML/DL Practitioner Developers & Data Scientists

6

NGC: GPU-OPTIMIZED SOFTWARE HUBSimplifying DL, ML and HPC Workflows

50+ ContainersDL, ML, HPC

Pre-trained ModelsNLP, Classification, Object Detection & more

Industry WorkflowsMedical Imaging, Intelligent Video Analytics

Model Training ScriptsNLP, Image Classification, Object Detection & more

Innovate Faster

Deploy Anywhere

Simplify Deployments

NGC

7

CONTAINERS

8

CONTAINERS: SIMPLIFYING WORKFLOWS

Simplifies Deployments

- Eliminates complex, time-consuming builds and installs

Get started in minutes

- Simply Pull & Run the app

Portable

- Deploy across various environments, from test to production with minimal changes

WHY CONTAINERS

9

NGC CONTAINERS: ACCELERATING WORKFLOWS

Simplifies Deployments

- Eliminates complex, time-consuming builds and installs

Get started in minutes

- Simply Pull & Run the app

Portable

- Deploy across various environments, from test to production with minimal changes

WHY CONTAINERS

Optimized for Performance- Monthly DL container releases offer latest features and

superior performance on NVIDIA GPUs

Scalable Performance

- Supports multi-GPU & multi-node systems for scale-up & scale-out environments

Designed for Enterprise & HPC environments

- Supports Docker & Singularity runtimes

Run Anywhere

- Pascal/Volta/Turing-powered NVIDIA DGX, PCs, workstations, and servers

- From Core to the Edge

- On-Prem to Hybrid to Cloud

WHY NGC CONTAINERS

11

DALI

CPU Bottleneck Waste GPU Cycles

- Complex I/O pipelines

- Multi-pipeline frameworks

- Decreasing CPU:GPU ratio

Eliminating CPU Bottleneck for DL Workflows

DALI Shifts Workloads to GPUs

- Full input pipeline acceleration including data loading and augmentation

- Integrated in PyTorch, TF, MxNET

- Supports Resnet50 & SSD

Loader Decode Resize Augment Training Loader Decode Resize Augment Training

Image classification (ResNet-50)

12

FP32Higher Precision

Range: +/- 3.402823x1038

TENSOR CORES BUILT FOR AI AND HPCMixed Precision Accelerator – Delivering Up To 5X Throughput of FP321

4x4 Product and Accumulate

FP32 = FP16 x FP16 + FP32

FP16Reduced Precision

Higher PerformanceRange: +/- 65,504

4x4 Matrix16 FP16 values

4x4 Matrix16 FP16 values

5X Throughput of FP321

D=A*B+C

1Fastest Tensor Core Speedup by Facebook on NMT (Arxiv paper Sep 2018)

https://arxiv.org/pdf/1806.00187.pdf

Memory Savings

• Half Storage Requirements (larger batch size)

• Half the memory traffic by reducing size of gradient/activation tensors

https://arxiv.org/pdf/1806.00187.pdf

13

MIXED PRECISION MAINTAINS ACCURACYBenefit From Higher Throughput Without Compromise

ILSVRC12 classification top-1 accuracy.(Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training“, ICLR 2018)

**Same hyperparameters and learning rate schedule as FP32.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

AlexNet VGG-D GoogleNet(Inception v1)

Inception v2 Inception v3 Resnet50

Model Accura

cy

FP32 Mixed Precision**

14

CONTINUOUS PERFORMANCE IMPROVEMENTDevelopers’ Software Optimizations Deliver Better Performance on the Same Hardware

Monthly DL Framework Updates & HPC Software Stack Optimizations Drive Performance

0

2000

4000

6000

8000

10000

12000

18.02 18.09 19.02

Imag

es/S

eco

nd

MxNet

Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100

0

50000

100000

150000

200000

250000

300000

350000

400000

18.05 18.09 19.02

Toke

ns/

Seco

nd

PyTorch

0

1000

2000

3000

4000

5000

6000

7000

8000

18.02 18.09 19.02Im

ages

/Sec

on

d

TensorFlow

Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100

Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19

x

2x

4x

6x

8x

10x

12x

14x

16x

18x

Mar '18 Nov '18 Mar '19

HPC Applications

15

MODEL REGISTRY & MODEL SCRIPTS

16

THE NGC MODEL REGISTRY

Repository of Popular AI Models

Starting point to retrain, prototype or

benchmark against your own models

Use As-Is or easily customize

Private hosted registry for NGC Enterprise

accounts to upload, share and version

17

DOMAIN SPECIFIC | INFERENCE-READY

PRE-TRAINED MODELS

Domain specific for video analytics and

medical imaging

Use transfer learning and your own

data to quickly create accurate AI

Available models: Organ & tumor

segmentation, x-ray classification,

classification and object detection for

video analytics

TENSORRT MODELS

Ready for inference with Tensor Cores

Precision: INT8, FP16, FP32

Optimized for multiple GPU architectures

Available Models: ResNet50, VGG16, InceptionV1, Mobilenet

19

LEARN | BUILD | OPTIMIZE | DEPLOY

MODEL SCRIPTS

Best practices for training models

Faster Performance with Optimized

Libraries and Tensor Cores

State-of-the-Art Accuracy

Image Recognition ResNet-50

Natural Language Processing Bert, Transformer

Object Detection SSD

Recommendation Neural Collaborative Filtering

Segmentation Mask R-CNN, UNET-Industrial

Speech Tacotron, WaveGlow

Translation GNMT

20

INDUSTRY SOLUTIONS

21

END-TO-END DEEP LEARNING WORKFLOWPre-Trained Models | Training & Adaptation | Ready to Integrate

Accelerate time to market

REMOVE PRODUCT NAMES

INFER ON INPUT DATA

ANALYZE DATA.DISPLAY RESULTS

OUTPUT MODEL

RETRAINOPTIMIZETRAIN WITH NEW DATA

PRETRAINED MODEL

22

Transfer Learning Toolkit

PRE-TRAINED MODELS

PRUNETRAINData

Converters

SAMPLE TRAINING PIPELINES

TUNE

AI

Inference

PERCEPTION GRAPH

TrackingCalibration

Processing

REFERENE APPLICATIONS

Analytics

Visualize

DeepStream SDK

Decoding

NVIDIA METROPOLISIntelligent Video Analytics for Smart Cities

RETRAIN WITH NEW DATA`

QUERY

VIDEO FEED

Frames detecting new class of objects

23

Clara Train SDK

PRE-TRAINED MODELS

TRANSFER

LEARNINGAI-ASSISTED

ANNOTATIONDICOM 2

NIFTI

TRAINING PIPELINES

TUNE

TRT INFERENCE

SERVER

PIPELINE MANAGER

STREAMING RENDER

DICOM

ADAPTERDEPLOYMENT PIPELINES WEBUI

Clara Deploy SDK

NVIDIA CLARA AI PLATFORMOrgan Segmentation for Medical Imaging

RETRAIN WITH NEW DATA

CT SCANS OF PATIENT’S LIVER

SEGMENTED LIVER