NGC
2
GRAND CHALLENGES REQUIRE MASSIVE COMPUTING
MEDICAL IMAGINGASTROPHYSICS NUCLEAR ENERGYAUTONOMOUS DRIVING FRAUD DETECTION TELECOM
3© 2018 VMware, Inc.
OPTIMIZATIONINSTALLATION
Complex, time
consuming, and error-
prone
Requires expertise to
optimize framework
performance
IT can’t keep up with
frequent software
upgrades
Users limited to older
features and lower
performance
CHALLENGES UTILIZING AI & HPC SOFTWARE
MAINTAINENCEPRODUCTIVITYEXPERTISE
Building AI-centric
solutions requires
expertise
4© 2018 VMware, Inc.
OPTIMIZED SOFTWARE
FASTER DEPLOYMENTS
Eliminates installations.
Simply Pull & Run the
app
Key DL frameworks
updated monthly for perf
optimization
Empowers users to
deploy the latest versions
with IT support
Better Insights and faster
time-to-solution
NGC – SIMPLIFYING AI & HPC WORKFLOWS
ZERO MAINTENANCE
HIGHER PRODUCTIVITY
EMBEDDING EXPERTISE
Deliver greater value,
faster
5© 2018 VMware, Inc.
NGC: APPLICATIONS TO END-TO-END SOLUTIONSW
ork
flow
/Valu
e C
hain
Containers Containers Models
Model Scripts
Classification | Object Detection | NLP/Translation
Text to Speech | Recommender
Industry Solutions
Smart Cities | Medical Imaging
Training SDK Deployment SDKKubeFlow
Pipelines
Proficiency of Skills
Advanced ML/DL Practitioner Developers & Data Scientists
6
NGC: GPU-OPTIMIZED SOFTWARE HUBSimplifying DL, ML and HPC Workflows
50+ ContainersDL, ML, HPC
Pre-trained ModelsNLP, Classification, Object Detection & more
Industry WorkflowsMedical Imaging, Intelligent Video Analytics
Model Training ScriptsNLP, Image Classification, Object Detection & more
Innovate Faster
Deploy Anywhere
Simplify Deployments
NGC
7
CONTAINERS
8
CONTAINERS: SIMPLIFYING WORKFLOWS
Simplifies Deployments
- Eliminates complex, time-consuming builds and installs
Get started in minutes
- Simply Pull & Run the app
Portable
- Deploy across various environments, from test to production with minimal changes
WHY CONTAINERS
9
NGC CONTAINERS: ACCELERATING WORKFLOWS
Simplifies Deployments
- Eliminates complex, time-consuming builds and installs
Get started in minutes
- Simply Pull & Run the app
Portable
- Deploy across various environments, from test to production with minimal changes
WHY CONTAINERS
Optimized for Performance- Monthly DL container releases offer latest features and
superior performance on NVIDIA GPUs
Scalable Performance
- Supports multi-GPU & multi-node systems for scale-up & scale-out environments
Designed for Enterprise & HPC environments
- Supports Docker & Singularity runtimes
Run Anywhere
- Pascal/Volta/Turing-powered NVIDIA DGX, PCs, workstations, and servers
- From Core to the Edge
- On-Prem to Hybrid to Cloud
WHY NGC CONTAINERS
10
GPU-OPTIMIZED SOFTWARE CONTAINERSOver 50 Containers on NGC
DEEP LEARNING MACHINE LEARNING
HPC VISUALIZATION
INFERENCE
GENOMICS
NAMD | GROMACS | more
RAPIDS | H2O | more TensorRT | DeepStream | more
Parabricks ParaView | IndeX | more
TensorFlow | PyTorch | more
11
DALI
CPU Bottleneck Waste GPU Cycles
- Complex I/O pipelines
- Multi-pipeline frameworks
- Decreasing CPU:GPU ratio
Eliminating CPU Bottleneck for DL Workflows
DALI Shifts Workloads to GPUs
- Full input pipeline acceleration including data loading and augmentation
- Integrated in PyTorch, TF, MxNET
- Supports Resnet50 & SSD
Loader Decode Resize Augment Training Loader Decode Resize Augment Training
Image classification (ResNet-50)
12
FP32Higher Precision
Range: +/- 3.402823x1038
TENSOR CORES BUILT FOR AI AND HPCMixed Precision Accelerator – Delivering Up To 5X Throughput of FP321
4x4 Product and Accumulate
FP32 = FP16 x FP16 + FP32
FP16Reduced Precision
Higher PerformanceRange: +/- 65,504
4x4 Matrix16 FP16 values
4x4 Matrix16 FP16 values
5X Throughput of FP321
D=A*B+C
1Fastest Tensor Core Speedup by Facebook on NMT (Arxiv paper Sep 2018)
https://arxiv.org/pdf/1806.00187.pdf
Memory Savings
• Half Storage Requirements (larger batch size)
• Half the memory traffic by reducing size of gradient/activation tensors
13
MIXED PRECISION MAINTAINS ACCURACYBenefit From Higher Throughput Without Compromise
ILSVRC12 classification top-1 accuracy.(Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training“, ICLR 2018)
**Same hyperparameters and learning rate schedule as FP32.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
AlexNet VGG-D GoogleNet(Inception v1)
Inception v2 Inception v3 Resnet50
Model Accura
cy
FP32 Mixed Precision**
14
CONTINUOUS PERFORMANCE IMPROVEMENTDevelopers’ Software Optimizations Deliver Better Performance on the Same Hardware
Monthly DL Framework Updates & HPC Software Stack Optimizations Drive Performance
0
2000
4000
6000
8000
10000
12000
18.02 18.09 19.02
Imag
es/S
eco
nd
MxNet
Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100
0
50000
100000
150000
200000
250000
300000
350000
400000
18.05 18.09 19.02
Toke
ns/
Seco
nd
PyTorch
0
1000
2000
3000
4000
5000
6000
7000
8000
18.02 18.09 19.02Im
ages
/Sec
on
d
TensorFlow
Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100
Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19
x
2x
4x
6x
8x
10x
12x
14x
16x
18x
Mar '18 Nov '18 Mar '19
HPC Applications
15
MODEL REGISTRY & MODEL SCRIPTS
16
THE NGC MODEL REGISTRY
Repository of Popular AI Models
Starting point to retrain, prototype or
benchmark against your own models
Use As-Is or easily customize
Private hosted registry for NGC Enterprise
accounts to upload, share and version
17
DOMAIN SPECIFIC | INFERENCE-READY
PRE-TRAINED MODELS
Domain specific for video analytics and
medical imaging
Use transfer learning and your own
data to quickly create accurate AI
Available models: Organ & tumor
segmentation, x-ray classification,
classification and object detection for
video analytics
TENSORRT MODELS
Ready for inference with Tensor Cores
Precision: INT8, FP16, FP32
Optimized for multiple GPU architectures
Available Models: ResNet50, VGG16, InceptionV1, Mobilenet
18
19
LEARN | BUILD | OPTIMIZE | DEPLOY
MODEL SCRIPTS
Best practices for training models
Faster Performance with Optimized
Libraries and Tensor Cores
State-of-the-Art Accuracy
Image Recognition ResNet-50
Natural Language Processing Bert, Transformer
Object Detection SSD
Recommendation Neural Collaborative Filtering
Segmentation Mask R-CNN, UNET-Industrial
Speech Tacotron, WaveGlow
Translation GNMT
20
INDUSTRY SOLUTIONS
21
END-TO-END DEEP LEARNING WORKFLOWPre-Trained Models | Training & Adaptation | Ready to Integrate
Accelerate time to market
REMOVE PRODUCT NAMES
INFER ON INPUT DATA
ANALYZE DATA.DISPLAY RESULTS
OUTPUT MODEL
RETRAINOPTIMIZETRAIN WITH NEW DATA
PRETRAINED MODEL
22
Transfer Learning Toolkit
PRE-TRAINED MODELS
PRUNETRAINData
Converters
SAMPLE TRAINING PIPELINES
TUNE
AI
Inference
PERCEPTION GRAPH
TrackingCalibration
Processing
REFERENE APPLICATIONS
Analytics
Visualize
DeepStream SDK
Decoding
NVIDIA METROPOLISIntelligent Video Analytics for Smart Cities
RETRAIN WITH NEW DATA`
QUERY
VIDEO FEED
Frames detecting new class of objects
23
Clara Train SDK
PRE-TRAINED MODELS
TRANSFER
LEARNINGAI-ASSISTED
ANNOTATIONDICOM 2
NIFTI
TRAINING PIPELINES
TUNE
TRT INFERENCE
SERVER
PIPELINE MANAGER
STREAMING RENDER
DICOM
ADAPTERDEPLOYMENT PIPELINES WEBUI
Clara Deploy SDK
NVIDIA CLARA AI PLATFORMOrgan Segmentation for Medical Imaging
RETRAIN WITH NEW DATA
CT SCANS OF PATIENT’S LIVER
SEGMENTED LIVER
Top Related