Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder...
Transcript of Deploying Deep Neural Networks to Embedded GPUs and CPUs - Matlab · MATLAB Coder and GPU Coder...
1© 2015 The MathWorks, Inc.
Deploying Deep Neural Networks
to Embedded GPUs and CPUs
Steven Thomsett
2
Deep Learning Workflow in MATLAB
Application
logic
Application
Design
Standalone
Deployment
Deep Neural Network
Design + Training
Trained
DNN
3
Deep Neural Network Design and Training
▪ Design in MATLAB
▪ Manage large data sets
▪ Automate data labeling
▪ Easy access to models
▪ Training in MATLAB
▪ Acceleration with GPU’s
▪ Scale to clusters
Train in MATLAB
Model
importer
Trained
DNN
Transfer
learning
Reference
model
4
Application Design
Pre-
processing
Post-
processing
5
Multi-Platform Deep Learning Deployment
Application
logic
6
Multi-Platform Deep Learning Deployment
Application
logic
Embedded
MobileNVIDIA Jetson Raspberry pi
Beaglebone
Desktop Data Center
…
7
Algorithm Design to Embedded Deployment Workflow
Conventional Approach
Functional test1
Desktop
GPU
Deployment
unit-test
2
Desktop
GPU
C++
Deployment
integration-test
3
Embedded GPU
C++
Real-time test4
High-level language
Deep learning framework
Large, complex software stack
Challenges
• Integrating multiple libraries and packages
• Verifying and maintaining multiple
implementations
• Algorithm & vendor lock-in
C/C++
Low-level APIs
Application-specific libraries
C/C++
Target-optimized libraries
Optimize for memory & speed
8
Solution: Use MATLAB Coder & GPU Coder for
Deep Learning Deployment
Application
logic
GPU
Coder
NVIDIA
TensorRT &
cuDNN
Libraries
Intel
MKL-DNN
Library
MATLAB
Coder
Target Libraries
ARM
Compute
Library
9
Solution: Use MATLAB Coder & GPU Coder for
Deep Learning Deployment
Application
logic
GPU
Coder
NVIDIA
TensorRT &
cuDNN
Libraries
Intel
MKL-DNN
Library
MATLAB
Coder
ARM
Compute
Library
10
Deep Learning Deployment Workflows
Pre-
processing
Post-
processing
codegen
Portable target code
INTEGRATED APPLICATION DEPLOYMENT
cnncodegen
Portable target code
INFERENCE ENGINE DEPLOYMENT
Trained
DNN
12
Workflow for Inference Engine Deployment
Steps for inference engine deployment
1. Generate the code for trained model>> cnncodegen(net, 'targetlib’, ‘arm-
compute’)
2. Copy the generated code onto target board
3. Build the code for the inference engine>> make –C ./codegen –f …mk
4. Use hand written main function to call inference
engine
5. Generate the exe and test the executable>> make –C ./ ……
cnncodegen
Portable target code
INFERENCE ENGINE DEPLOYMENT
Trained
DNN
14
Deep Learning Inference Deployment
Application
logic
GPU
Coder
NVIDIA
TensorRT &
cuDNN
Libraries
Intel
MKL-DNN
Library
MATLAB
Coder
Target Libraries
ARM
Compute
Library
Pedestrian Detection
15
Deep Learning Inference Deployment
Application
logic
GPU
Coder
NVIDIA
TensorRT &
cuDNN
Libraries
Intel
MKL-DNN
Library
MATLAB
Coder
Target Libraries
ARM
Compute
Library
Blood Smear Segmentation
16
ARM
Compute
Library
Deep Learning Inference Deployment
Application
logic
GPU
Coder
NVIDIA
TensorRT &
cuDNN
Libraries
Intel
MKL-DNN
Library
MATLAB
Coder
Target Libraries
Defect Classification &
Detection
18
ARM
Compute
Library
Application
logic
GPU
Coder
NVIDIA
TensorRT &
cuDNN
Libraries
Intel
MKL-DNN
Library
MATLAB
Coder
Target Libraries
How is the
Performance?
19
Performance of Generated Code
▪ CNN inference (ResNet-50, VGG-16, Inception V3) on Titan V GPU
▪ CNN inference (ResNet-50) on Jetson TX2
▪ CNN inference (ResNet-50 , VGG-16, Inception V3) on Intel Xeon CPU
20
Single Image Inference on Titan V using cuDNN
PyTorch (1.0.0)
MXNet (1.4.0)
GPU Coder (R2019a)
TensorFlow (1.13.0)
Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0
21
Even Stronger Performance with INT8 using TensorRT
Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7.4.1 – TensorRT 5.0.2.6 - Frameworks: TensorFlow 1.13.0_rc0
Batch Size
GPU Coder + TensorRT (FP32)
TensorFlow + TensorRT (FP32)
ResNet-50 Inference (Titan V)
GPU Coder + TensorRT (INT8)
TensorFlow + TensorRT (INT8)
22
Single Image Inference on Jetson TX2
NVIDIA libraries: CUDA9 - cuDNN 7 – TensorRT 3.0.4 - Frameworks: TensorFlow 1.12.0
GPU Coder
+
TensorRT
TensorFlow
+
TensorRT
ResNet-50
23
CPU Performance
MATLAB
TensorFlow
MXNet
MATLAB Coder
PyTorch
Intel® Xeon® CPU 3.6 GHz - Frameworks: TensorFlow 1.6.0, MXNet 1.2.1, PyTorch 0.3.1
CPU, Single Image Inference (Linux)
24
Brief Summary
DNN libraries are great for inference, …
MATLAB Coder and GPU Coder generates code that takes advantage of:
NVIDIA® CUDA libraries, including TensorRT & cuDNN
Intel® Math Kernel Library for Deep Neural Networks
(MKL-DNN)
ARM® Compute libraries for mobile platforms
25
Brief Summary
DNN libraries are great for inference, …
MATLAB Coder and GPU Coder generates code that takes advantage of:
NVIDIA® CUDA libraries, including TensorRT & cuDNN
Intel® Math Kernel Library for Deep Neural Networks
(MKL-DNN)
ARM® Compute libraries for mobile platforms
But, Applications
Require More than
just Inference
26
Deep Learning Workflows: Integrated Application Deployment
Pre-
processing
Post-
processing
codegen
Portable target code
27
Lane
Detection
Strongest
Bounding
Box
Lane and Object Detection using YOLO v2
Post-
processing
Object
Detection
Workflow:
1) Test in MATLAB on CPU
2) Generate code and test on
desktop GPU
3) Generate code and test on
Jetson AGX Xavier GPU
AlexNet-based YOLO v2
28
Lane
Detection
Strongest
Bounding
Box
(1) Test in MATLAB on CPU
Post-
processing
Object
Detection
AlexNet-based YOLO v2
29
Lane
Detection
Strongest
Bounding
Box
(2) Generate Code and Test on Desktop GPU
Post-
processing
Object
Detection
cuDNN/TensorRT optimized code
CUDA optimized code
AlexNet-based YOLO v2
30
Lane
Detection
Strongest
Bounding
Box
(3) Generate Code and Test on Jetson AGX Xavier GPU
Post-
processing
Object
Detection
cuDNN/TensorRT optimized code
CUDA optimized code
AlexNet-based YOLO v2
31
Lane
Detection
Strongest
Bounding
Box
Lane and Object Detection using YOLO v2
Post-
processing
Object
Detection
cuDNN/TensorRT optimized code
CUDA optimized code
AlexNet-based YOLO v2
1) Running on CPU
2) 7X faster running generate
code on desktop GPU
3) Generate code and test on
Jetson AGX Xavier GPU
32
Accessing Hardware
Deploy Standalone
Application
Access Peripheral
from MATLAB
Processor-in-Loop
Verification
33
Deploy to Target Hardware via Apps and Command Line
34
PyTorch (1.0.0)
MXNet (1.4.0)
GPU Coder (R2019a)
TensorFlow (1.13.0)
35
PyTorch (1.0.0)
MXNet (1.4.0)
GPU Coder (R2019a)
TensorFlow (1.13.0)
How does
MATLAB Coder and
GPU Coder
achieve these results?
36
Coders Apply Various Optimizations
….
….
CUDA kernel
lowering
Traditional compiler
optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory mapping
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loop
optimizations
37
Coders Apply Various Optimizations
….
….
CUDA kernel
lowering
Traditional compiler
optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory mapping
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loop
optimizations
Optimized
Libraries
Network
OptimizationCoding
Patterns
38
Intel MKL-DNN
Library
Generated Code Calls Optimized Libraries
Pre-
processing
Post-
processing
NVIDIA TensorRT &
cuDNN Libraries
ARM Compute
Library
cuFFT, cuBLAS,
cuSolver, Thrust
Libraries
Performance
1. Optimized Libraries
2. Network Optimizations
3. Coding Patterns
39
Network
Deep Learning Network Optimization
conv
Batch
Norm
ReLu
Add
conv
ReLu
Max
Pool
Max
Pool
Layer fusion
Optimized computation
FusedConv
FusedConv
BatchNormAdd
Max
Pool
Max
Pool
Buffer minimization
Optimized memory
FusedConv
FusedConv
BatchNormAdd
Max
Pool
buffer a
buffer b
buffer d
Max
Pool
buffer c
buffer e
X Reuse buffer a
X Reuse buffer b
Performance
1. Optimized Libraries
2. Network Optimizations
3. Coding Patterns
40
Coding Patterns: Stencil Kernels
▪ Automatically applied for image processing functions (e.g. imfilter, imerode,
imdilate, conv2, …)
▪ Manually apply using gpucoder.stencilKernel()
Dotprod
Input image Conv. kernel Output image
rows
cols
kw
kh
Performance
1. Optimized Libraries
2. Network Optimizations
3. Coding Patterns
41
Coding Patterns: Matrix-Matrix Kernels
▪ Automatically applied for many MATLAB functions (e.g. matchFeatures
SAD, SSD, pdist, …)
▪ Manually apply using gpucoder.matrixMatrixKernel()
Performance
1. Optimized Libraries
2. Network Optimizations
3. Coding Patterns
42
Deep Learning Workflow in MATLAB
Deep Neural Network
Design + Training
Train in MATLAB
Model
importer
Trained
DNN
Transfer
learning
Reference
model
Application
Design
Application
logic
Standalone
Deployment
TensorRT and
cuDNN Libraries
MKL-DNN
Library
Coders
ARM Compute
Library
Application
logic
Application
Design
Standalone
Deployment
Deep Neural Network
Design + Training
Trained
DNN
43
Deep Learning Workflow in MATLAB
Deep Neural Network
Design + Training
Train in MATLAB
Model
importer
Trained
DNN
Transfer
learning
Reference
model
Application
Design
Application
logic
Standalone
Deployment
TensorRT and
cuDNN Libraries
MKL-DNN
Library
Coders
ARM Compute
Library