Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas,...
Transcript of Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas,...
![Page 1: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/1.jpg)
1© 2019 The MathWorks, Inc.
Deploying AI on Jetson Xavier/DRIVE Xavier with
TensorRT and MATLAB
Jaya Shankar, Engineering Manager (Deep Learning Code Generation)
Avinash Nehemiah, Principal Product Manager ( Computer Vision, Deep Learning, Automated Driving)
![Page 2: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/2.jpg)
2
Outline
Ground Truth Labeling Network Design and Training
CUDA and TensorRT Code Generation
Jetson Xavier and DRIVE Xavier Targeting
Key TakeawaysOptimized CUDA and TensorRT code generation Jetson Xavier and DRIVE Xavier targetingProcessor-in-loop(PIL) testing and system integration
Key TakeawaysPlatform Productivity: Workflow automation, ease of use Framework Interoperability: ONNX, Keras-TensorFlow, Caffe
![Page 3: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/3.jpg)
3
Example Used in Today’s Talk
Lane Detection Network
Co-ordinate Transform
Bounding Box Processing
AI Application
YOLOv2 Network
![Page 4: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/4.jpg)
4
Outline
Ground Truth Labeling Network Design and Training
CUDA and TensorRT Code Generation
Jetson Xavier and DRIVE Xavier Targeting
![Page 5: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/5.jpg)
5
Ground Truth LabelingUnlabeled Training Data Labels for Training
![Page 6: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/6.jpg)
6
Interactive Tools for Ground Truth Labeling
ROI Labels• Bound boxes• Pixel labels • Poly-lines
Scene Labels
![Page 7: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/7.jpg)
7
Automate Ground Truth Labeling
Pre-built Automation
User authored automation
![Page 8: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/8.jpg)
8
Automating Labeling of Lane Markers
![Page 9: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/9.jpg)
9
Automate Labeling of Bounding Boxes for Vehicles
![Page 10: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/10.jpg)
10
Export Labeled Data for Training
Bounding Boxes Labels
Polyline Labels
![Page 11: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/11.jpg)
11
Outline
Ground Truth Labeling Network Design and Training
CUDA and TensorRT Code Generation
Jetson Xavier and DRIVE Xavier Targeting
![Page 12: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/12.jpg)
12
Example Used in Today’s Talk
Lane Detection Network
Co-ordinate Transform
Bounding Box Processing
AI Application
YOLOv2 Network
![Page 13: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/13.jpg)
13
Lane Detection Algorithm
Pretrained Network (E.g. AlexNet)
Modify Network for
Lane DetectionCoefficients of parabola
Transform to
Image Coordinates
![Page 14: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/14.jpg)
14
Lane Detection: Load Pretrained Network
Lane Detection Network▪ Regression CNN for lane parameters ▪ MATLAB code to transform to image co-ordinates
>> net = alexnet
>> deepNetworkDesigner
![Page 15: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/15.jpg)
15
View Network in Deep Network Designer App
![Page 16: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/16.jpg)
16
Remove Layers from AlexNet
![Page 17: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/17.jpg)
17
Add Regression Output for Lane Parameters
Regression Output for
Lane Coefficients
![Page 18: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/18.jpg)
18
Transparently Scale Compute for Training
Specify Training on:
Quickly change training hardware
Works on Windows
(no additional setup)
![Page 19: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/19.jpg)
19
NVIDIA NGC & DGX Supports MATLAB for Deep Learning
▪ GPU-accelerated MATLAB Docker container for deep learning
– Leverage multiple GPUs on NVIDIA DGX Systems and in the Cloud
▪ Cloud providers include: AWS, Azure, Google, Oracle, and Alibaba
▪ NVIDIA DGX System / Station
– Interconnects 4/8/16 Volta GPUs in one box
▪ Containers available for R2018a and R2018b– New Docker container with every major release (a/b)
▪ Download MATLAB container from NGC Registry– https://ngc.nvidia.com/registry/partners-matlab
S9469 - MATLAB and NVIDIA Docker: A Complete AI Solution, Where You Need It, in an InstantWednesday, Mar 20, 4:00 PM - 04:50 PM
![Page 20: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/20.jpg)
20
Evaluate Lane Boundary Detections vs. Ground Truth
evaluateLaneBoundaries
![Page 21: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/21.jpg)
21
Example Used in Today’s Talk
Lane Detection Network
Co-ordinate Transform
Bounding Box Processing
AI Application
YOLOv2 Network
![Page 22: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/22.jpg)
22
YOLO v2 Object Detection
Pretrained Network Feature Extractor ( E.g. ResNet 50)
Detection SubnetworkYOLO CNN Network Decode Predictions
![Page 23: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/23.jpg)
23
Model Exchange with MATLAB
PyTorch
Caffe2
MXNet
Core ML
CNTK
Keras-
Tensorflow
Caffe
MATLABONNX
Open Neural Network Exchange
![Page 24: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/24.jpg)
24
Import Pretrained Network in ONNX Format
![Page 25: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/25.jpg)
25
Modify Network
imageInputLayer replaces the input and subtraction layer
Removing the 2 ResNet-50 layers
![Page 26: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/26.jpg)
26
YOLOv2 Detection Network
▪ yolov2Layers: Create network architecture
>> lgraph = yolov2Layers(imageSize, numClasses, anchorBoxes, network, featureLayer)
Number of Classes
Pretrained Feature Extractor
>> detector = trainYOLOv2ObjectDetector(trainingData,lgraph,options)
![Page 27: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/27.jpg)
27
Evaluate Performance of Trained Network
▪ Set of functions to evaluate trained
network performance
– evaluateDetectionMissRate
– evaluateDetectionPrecision
– bboxPrecisionRecall
– bboxOverlapRatio
>> [ap,recall,precision] =
evaluateDetectionPrecision(results,vehicles(:,2));
evaluateDetectionPrecision
bboxPrecisionRecall
![Page 28: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/28.jpg)
28
Example Applications using MATLAB for AI Development
Lane Keeping Assist using
Reinforcement LearningOccupancy Grid Creation
using Deep Learning
Lidar Segmentation with
Deep Learning
![Page 29: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/29.jpg)
29
Outline
Ground Truth Labeling Network Design and Training
CUDA and TensorRT Code Generation
Jetson Xavier and DRIVE Xavier Targeting
Key TakeawaysPlatform Productivity: Workflow automation, ease of use Framework Interoperability: ONNX, Keras-TensorFlow, Caffe
![Page 30: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/30.jpg)
30
CUDA code generation with GPU Coder
Ease of programming(expressivity, algorithmic, …)
Pe
rfo
rma
nce
easier
faster
MATLAB
Python
CUDA
OpenCL
C/C++
GPUs are “hardware on steroids”, … but, programming them is hard
GPU Coder
![Page 31: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/31.jpg)
31
Consider an example: saxpy
Simple MATLAB Loop
Automatic compilation from an expressive language to a high-performance language
![Page 32: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/32.jpg)
32
Deep Learning code generation with GPU Coder
TensorRT is great for inference, … but, applications require more than inference
TensorRT
Deep learning
![Page 33: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/33.jpg)
33
Regression DNN for lane
coefficients
Post process and
transform to image
coordinates
Strongest bounding
box
DNN Application
YOLOv2 object
detection DNN
Generate CUDA/C++/TensorRT code
with GPU Coder
Generate code for your application with GPU Coder
![Page 34: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/34.jpg)
34
Regression DNN for lane
coefficients
Post process and
transform to image
coordinates
Strongest bounding
box
DNN Application
YOLOv2 object
detection DNN
Generate Optimized CUDA code
![Page 35: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/35.jpg)
35
GPU Coder automatically extracts parallelism from MATLAB
1. Scalarized MATLAB (“for-all” loops)
2. Vectorized MATLAB(math operators and library functions)
3. Composite functions in MATLAB(maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT)
Infer CUDA
kernels from
MATLAB loops
Library
replacement
![Page 36: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/36.jpg)
36
GPU Coder runs a host of compiler transforms to generate CUDA
Control-flow graph
Intermediate representation
(CFG – IR)
….
….
CUDA kernel
lowering
Front – end
Traditional compiler
optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory mapping
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loop
optimizations
![Page 37: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/37.jpg)
37
static __global__ mykernel(A, X, Y, C, n)
{
int k = getThreadIndex(N);
int t = A[k] * X[k];
C[k] = t + Y[k];
}
From a loop to a CUDA kernel
for k = 1:n
t = A(k) .* X(k);
C(k) = t + Y(k);
end
{ …
mykernel<<< f(n) >>>(A, X, Y, C, n);
…
}
Create kernel from
loop bodyCompute kernel size
Y
f(n)
Classify kernel variables
(input, output, local)
Ins: A, X, Y, n
Outs: C
Local: t, k
Is this loop
parallel?
Dependence analysis
to understand the
iteration space
Extracting parallelism from loops
1. Scalarized MATLAB (for loops)
2. Vectorized MATLAB
![Page 38: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/38.jpg)
38
From a loop to a CUDA kernel
for i = 1:p
for j = 1:m
for k = 1:n
…(inner loop)…
end
end
end
for i = 1:p
…(outer prologue code)…
for j = 1:m
for k = 1:n
…(inner loop)…
end
…(outer epilogue code)…
end
end
Perfect LoopsImperfect Loops
Extracting parallelism from loops
1. Scalarized MATLAB (for loops)
2. Vectorized MATLAB
![Page 39: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/39.jpg)
39
From a loop to a CUDA kernel
for i = 1:p
for j = 1:m
for k = 1:n
… (outer prologue code) …
…(inner loop)…
if k == n
… (outer epilogue code) …
end
end
end
end
for i = 1:p
…(outer prologue code)…
for j = 1:m
for k = 1:n
…(inner loop)…
end
…(outer epilogue code)…
end
end
Perfect LoopsImperfect Loops
Loop Perfectization
Extracting parallelism from loops
1. Scalarized MATLAB (for loops)
2. Vectorized MATLAB
![Page 40: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/40.jpg)
40
From vectorized MATLAB to CUDA kernels
output(:, 1) = (input(:, 1) – x_im) .* factor;
Loop Fusion
Create larger parallel
loops (and hence
CUDA kernels)
Scalarization
Reduce to
for-loops
for i = 1:M
diff(i) = input(i, 1) – x_im(i);
end
for a = 1:M
output(i, 1) = diff(i) * factor(i);
end
for i = 1:M
diff(i) = input(i, 1) – x_im(i);
output(i, 1) = diff(i) * factor(i);
end
Assume the following sizes:
‘output’ : M x 3
‘input’ : M x 3
‘x_im’ : M x 1
‘factor’ : M x 1
Extracting parallelism from loops
1. Scalarized MATLAB (for loops)
2. Vectorized MATLAB
![Page 41: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/41.jpg)
41
From vectorized MATLAB to CUDA kernels
output(:, 1) = (input(:, 1) – x_im) .* factor;
Loop Fusion
Create larger parallel
loops (and hence
CUDA kernels)
Scalar
Replacement
Reduce temp matrices
to temp scalars
Scalarization
Reduce to
for-loops
for i = 1:M
diff(i) = input(i, 1) – x_im(i);
end
for a = 1:M
output(i, 1) = diff(i) * factor(i);
end
for i = 1:M
diff(i) = input(i, 1) – x_im(i);
output(i, 1) = diff(i) * factor(i);
end
for i = 1:M
tmp = input(i, 1) – x_im(i);
output(i, 1) = tmp * factor(i);
end
Assume the following sizes:
‘output’ : M x 3
‘input’ : M x 3
‘x_im’ : M x 1
‘factor’ : M x 1
Extracting parallelism from loops
1. Scalarized MATLAB (for loops)
2. Vectorized MATLAB
![Page 42: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/42.jpg)
42
Optimizing CPU-GPU data movement is a challenge
A = …
…
for i = 1:N
… A(i)
end
…
imfilter
…
A = …
…
kernel1<<<…>>>( )
…
imfilter_kernel(…)
…
Where is the ideal placement of memcpy?
Naïve placement
K1
K2
K3
Optimized placement
K1
K2
K3
A = …
…
cudaMemcpyHtoD(gA, a);
kernel1<<<…>>>(gA)
cudaMemcpyDtoH(…)
…
cudaMemcpyHtoD(…)
imfilter_kernel(…)
cudaMemcpyDtoH(…)
…
Memory Optimizations
![Page 43: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/43.jpg)
43
GPU Coder optimizes memcpy placement
A(:) = ….
C(:) = ….
for i = 1:N
….
gB = kernel1(gA);
gA = kernel2(gB);
if (some_condition)
gC = kernel3(gA, gB);
end
….
end
…. = C;
cudaMemcpy
*definitely* needed
cudaMemcpy
*not* needed
cudaMemcpy
*may be* needed
Observations
• Equivalent to Partial redundancy elimination (PRE)
• Dynamic strategy – track memory location with a
status flag per variable
• Use-Def to determine where to insert memcpy
A(:) = …
A_isDirtyOnCpu = true;
…
for i = 1:N
if (A_isDirtyOnCpu)
cudaMemcpy(gA, A);
A_isDirtyOnCpu = false;
end
gB = kernel1(gA);
gA = kernel2(gB);
if (somecondition)
gC = kernel3(gA, gB);
C_isDirtyOnGpu = true;
end
…
end
…
if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);
C_isDirtyOnGpu = false;
end
… = C;
Assume gA, gB and gC are mapped to GPU memory
Generated (pseudo) code
![Page 44: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/44.jpg)
44
From composite functions to optimized CUDA
• imfilter
• imresize
• imerode
• imdilate
• bwlabel
• imwarp
• …
• Deep learning inference
(cuDNN, TensorRT)
• …
• Matrix multiply (cuBLAS)
• Linear algebra (cuSolver)
• FFT functions (cuFFT)
• Convolution
• …
Core math Image processing
Computer visionNeural Networks
Over 300+ MATLAB functions are optimized for CUDA code generation
![Page 45: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/45.jpg)
45
Dotprod
Input image Conv. kernel Output image
rows
co
ls
kw
kh
GPU Coder automatically maps data to shared memory
gpucoder.stencilKernel() automatically generates shared memory algorithm
Used when generating code from any image processing functions in MATLABimfilter, imerode, imdilate, conv2, …
![Page 46: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/46.jpg)
46
Generates optimal matrix-matrix operations
gpucoder.matrixMatrixKernel automatically generates shared memory algorithm
Used when generating code from many standard MATLAB functions matchFeatures SAD, SSD, pdist, …
![Page 47: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/47.jpg)
47
Generate CUDA friendly algorithms from MATLAB functions
Reductions Transpose Histogram
Sort Min/Max CumSum
![Page 48: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/48.jpg)
48
Integrating with legacy CUDA code
out = coder.ceval(‘-gpudevicefcn,
‘foo’, % your function name
a, % arguments
b);
__device__ float foo(float a, float b);
Calling a GPU device function
function out = foo(in)
……
……end
Caller needs to be a device
function
Arguments allocated in
device memory
![Page 49: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/49.jpg)
49
function out = foo(in)……
……end
__global__
void foo(const float *a[100], const float *b[100], float *c[100] );
Integrating with legacy CUDA code
out = coder.ceval( ‘foo<1024>’, % kernel call with launch params
coder.rref(a, ‘-gpu’), % value is read on GPU
coder.rref(b, ‘-gpu’),
coder.wref(c, ‘-gpu’)); % value is written on GPU
Calling a kernel function
Arguments allocated or copied
to device memory as required.
Caller may be a host or
device function
![Page 50: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/50.jpg)
50
Compiled image processing applications
Fog Rectification
5x speedup
Feature Matching
15x speedup
Stereo Disparity
8x speedup
![Page 51: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/51.jpg)
51
Regression DNN for lane
coefficients
Post process and
transform to image
coordinates
Strongest bounding
box
DNN Application
YOLOv2 object
detection DNN
Generate Optimized CUDA code Generate optimized
cuDNN/TensorRT inference code
![Page 52: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/52.jpg)
52
Regression DNN for lane
coefficients
Post process and
transform to image
coordinates
Strongest bounding
box
DNN Application
Object detection with
YOLOv2 DNN
Generate optimized cuDNN/TensorRT inference code
Layer Fusion
DNN optimizations
Memory
optimization
Network re-
architecting
![Page 53: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/53.jpg)
53Network
Deep learning network optimizations
conv
Batch
Norm
ReLu
Add
conv
ReLu
Max
Pool
Max
Pool
Layer fusion
Optimized computation
FusedConv
FusedConv
BatchNormAdd
Max
Pool
Max
Pool
Buffer minimization
Optimized memory
FusedConv
FusedConv
BatchNormAdd
Max
Pool
buffer a
buffer b
buffer d
Max
Pool
buffer c
buffer e
X Reuse buffer a
X Reuse buffer b
![Page 54: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/54.jpg)
54
Code generation workflow
Deployment
unit-test
Desktop
GPU
.mex
Build type
Call compiled
application from
MATLAB directly
![Page 55: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/55.jpg)
55
Code generation workflow
Deployment
unit-test
Desktop
GPU
.mex
Build type
Call compiled
application from
MATLAB directly
![Page 56: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/56.jpg)
56
Semantic Segmentation Defect classification and detection
Blood smear segmentation
Live Demos
at our GTC
booth #1343
![Page 57: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/57.jpg)
57
Single Image Inference on Titan V using cuDNN
PyTorch (1.0.0)
MXNet (1.4.0)
GPU Coder (R2019a)
TensorFlow (1.13.0)
Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0
![Page 58: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/58.jpg)
58
MATLAB GPU Coder
TensorRT Accelerates Inference Performance on Titan V
TensorFlow
( 7.4 ) ( 5.0 )
![Page 59: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/59.jpg)
59
GPU Coder
TensorRT Accelerates Inference Performance on Titan V
TensorFlow
Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0
Single Image Inference with ResNet-50 (Titan V)
cuDNN TensorRT (FP32) TensorRT (INT8)
![Page 60: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/60.jpg)
60
Outline
Ground Truth Labeling Network Design and Training
CUDA and TensorRT Code Generation
Jetson Xavier and DRIVE Xavier Targeting
![Page 61: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/61.jpg)
61
Deploy to Jetson and Drive
MATLAB algorithm
(functional reference)
Functional test1 Deployment
unit-test
2
Desktop
GPU
.mex
Build type
Call compiled
application from
MATLAB directly
C++
Deployment
integration-test
3
Desktop
GPU
.lib
Call compiled
application
from hand-coded
main()
Real-time test4
Embedded GPU
Deploy to target
Deploy to target
and run with
hardware-in-loop
![Page 62: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/62.jpg)
62
Streamlined deployment to Jetson or Drive GPUs.
Host Machine
Deploy standalone application
![Page 63: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/63.jpg)
63
Closed loop testing and verification on embedded GPUs.
Host Machine
Run application in MATLAB with Hardware in loop
![Page 64: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/64.jpg)
64
Deploying DNN application to Jetson/Drive device
Target Display
Host Machine
Video Feed
Deployable MATLAB
algorithm code
![Page 65: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/65.jpg)
65
Access to target peripherals from MATLAB
MATLAB talks to peripherals on hardware
Verification with processor-in-loop mode
Streams from
target peripherals
to MATLAB
![Page 66: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/66.jpg)
66
Generate code for peripheral access
Generate code to access peripherals on board
Standalone deployment mode
Generates interface
code for target
peripherals
![Page 67: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/67.jpg)
67
Processor-in-loop verificationwith MATLAB
GPU Coder enables hardware prototyping and system integration
Deploy and run on NVIDIA target
Jetson/DRIVE platformMATLAB on host
Peripheral access support
![Page 68: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/68.jpg)
68
In Summary
Ground Truth Labeling Network Design and Training
CUDA and TensorRT Code Generation
Jetson Xavier and DRIVE Xavier Targeting
DeploymentOptimized CUDA and TensorRT code generation Jetson Xavier and DRIVE Xavier targetingProcessor-in-loop(PIL) testing and system integration
Network designPlatform Productivity: Workflow automation, ease of use Framework Interoperability: ONNX, Keras-TensorFlow, Caffe
![Page 69: Deploying AI on Jetson Xavier/DRIVE Xavier with …...Composite functions in MATLAB (maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library](https://reader030.fdocuments.us/reader030/viewer/2022040907/5e7d878a4e82520c1f33a9f3/html5/thumbnails/69.jpg)
69
Thank You