Deploying Deep Learning Networks to Embedded …...NVIDIA Jetson TX1 board Multi-Platform Deep...

Deploying Deep Learning Networks

to Embedded GPUs and CPUs

Rishu Gupta, PhD

Senior Application Engineer, Computer Vision

MATLAB Deep Learning Framework

Access Data Design + Train Deploy

▪ Manage large image sets

▪ Automate image labeling

▪ Easy access to models

▪ Acceleration with GPU’s

▪ Scale to clusters

Multi-Platform Deep Learning Deployment

Embedded

MobileNvidia

TX1, TX2, TK1 Raspberry pi Beagle bone

Desktop Data-center

▪ Need code that takes advantage

– NVIDIA® CUDA libraries, including

cuDNN and TensorRT

– Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN) for

Intel processors

– ARM® Compute libraries for ARM

processors

Intel Xeon Desktop PC Raspberry Pi Board

Android Phone

NVIDIA Jetson TX1 board

▪ Need code that takes advantage

– NVIDIA® CUDA libraries, including

cuDNN and TensorRT

– Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN) for

Intel processors

– ARM® Compute libraries for ARM

processors

Algorithm Design to Embedded Deployment Workflow

Conventional Approach

High-level language

Deep learning framework

Large, complex software stack1

Desktop GPU

Low-level APIs

Application-specific libraries2

Embedded GPU

Target-optimized libraries

Optimize for memory & speed3

Challenges

• Integrating multiple libraries and

packages

• Verifying and maintaining multiple

implementations

• Algorithm & vendor lock-in

Solution- GPU Coder for Deep Learning Deployment

GPU Coder

Target Libraries

NVIDIA

TensorRT &

Libraries

Compute

Library

MKL-DNN

LibraryApplication

Deep Learning Deployment Workflows

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

cnncodegen

INFERENCE ENGINE DEPLOYMENT

Trained

Workflow for Inference Engine Deployment

Steps for inference engine deployment

1. Generate the code for trained model>> cnncodegen(net, 'targetlib', 'cudnn’)

2. Copy the generated code onto target board

3. Build the code for the inference engine>> make –C ./codegen –f …mk

4. Use hand written main function to call inference engine

5. Generate the exe and test the executable>> make –C ./ ……

cnncodegen

INFERENCE ENGINE DEPLOYMENT

Trained

How to get a Trained DNN into MATLAB?

Train in MATLAB

importer

Trained

Transfer

learningReference model

Deep Learning Inference Deployment

Train in MATLAB Trained

Target Libraries

NVIDIA

TensorRT &

Libraries

Compute

Library

MKL-DNN

LibraryModel

importer

Transfer

Building DNN from Scratch

Load Training Data

Build Layer Architecture

Set Training

Options

Train Network

%% Create a datastore

imds = imageDatastore(‘Data’,...

'IncludeSubfolders',true,'LabelSource','foldernames');

num_classes = numel(unique(imds.Labels));

%% Build layer architecture

layers = [imageInputLayer([64 32 3])

convolution2dLayer(5,20)

reluLayer()

maxPooling2dLayer(2,'Stride',2)

fullyConnectedLayer(512)

softmaxLayer()

classificationLayer()];

%% Set Training Options

trainOpts = trainingOptions( 'sgdm',...

'MiniBatchSize', miniBatchSize,...

'Plots', 'training-progress’);

%% Train Network

net = trainNetwork(imds, layers, trainOpts);

Pedestrian Detection DNN Deployment on ARM Processor

layers = [imageInputLayer([64 32 3])

reluLayer()

CrossChannelNormalizationLayer(5,'K',1);

reluLayer()

softmaxLayer()

classificationLayer()];

Pedestrian Detection DNN Deployment on ARM Processor

▪ ARM Neon instruction set architecture

– Example: ARM Cortex A

▪ ARM Compute Library

– Low-level Software functions

– Computer vision, machine learning etc…

▪ Pedestrian detection on Raspberry pi

Train in MATLAB

importer

Trained

Target Libraries

NVIDIA

TensorRT &

Libraries

Compute

Library

MKL-DNN

Library

Transfer

Importing DNN from Open Source Framework

Caffe Model Importer

(including Caffe Model Zoo)

▪ importCaffeLayers

▪ importCaffeNetwork

TensorFlow-Keras Model Importer

▪ importKerasLayers

▪ importKerasNetwork

network = importCaffeNetwork(protofile, 'yolo.caffemodel');

Train in MATLAB

importer

Trained

Target Libraries

NVIDIA

TensorRT &

Libraries

Compute

Library

MKL-DNN

Library

Transfer

Object Detection

Train in MATLAB

importer

Trained

Target Libraries

Transfer

NVIDIA

TensorRT &

Libraries

Compute

Library

MKL-DNN

Library

Train in MATLAB

importer

Trained

Target Libraries

Transfer

NVIDIA

TensorRT &

Libraries

Compute

Library

MKL-DNN

Library

Layered Architecture for Segnet- Semantic Segmentation

DAG Network

Total number of layers: 91

TESLA V100

DRIVE PX 2

TESLA P4

JETSON TX2

NVIDIA DLA

PROGRAMMABLE INFERENCE ACCELERATOR

NVIDIA TensorRT

TensorRT

Performance Summary (VGG-16) on TitanXP

0 50 100 150 200 250 300 350 400

MATLAB (cuDNN fp32)

GPU Coder (cuDNN fp32)

GPU Coder (TensorRT fp32)

GPU Coder (TensorRT int8)

How Good is Generated Code Performance?

▪ Performance of CNN inference (Alexnet) on Titan XP GPU

▪ Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2

Alexnet Inference on NVIDIA Titan Xp

GPU Coder +

TensorRT (3.0.1)

GPU Coder +

second

Batch Size

CPU Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

MXNet (1.1.0)

GPU Coder +

TensorRT (3.0.1, int8)

TensorFlow (1.6.0)

VGG-16 Inference on NVIDIA Titan Xp

GPU Coder +

TensorRT (3.0.1)

GPU Coder +

second

Batch Size

CPU Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

MXNet (1.1.0)

GPU Coder +

TensorRT (3.0.1, int8)

TensorFlow (1.6.0)

Alexnet Inference on Jetson TX2: Frame-Rate Performance

GPU Coder + cuDNN

Batch Size

C++ Caffe (1.0.0-rc5)

GPU Coder +

TensorRT

second

Brief Summary

DNN libraries are great for inference, …

▪ GPU coder generates code that takes advantage

NVIDIA® CUDA libraries, including

cuDNN, and TensorRT

Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN)

ARM® Compute libraries for mobile

platforms

Brief Summary

DNN libraries are great for inference, …

▪ GPU coder generates code that takes advantage

NVIDIA® CUDA libraries, including

cuDNN, and TensorRT

Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN)

ARM® Compute libraries for mobile

platforms

but, applications require more than just inference

Deep learning Workflows- Integrated Application Deployment

processing

codegen

INTEGRATED APPLICATION DEPLOYMENT

Traffic sign detection and recognition

Object

detection

Strongest

Bounding

Classifier

YOLO Recognition net

Traffic sign detection and recognition

GPU Coder Helps You Deploy Applications to GPUs Faster

GPU Coder

CUDA Kernel creation

Memory allocation

Data transfer minimization

• Library function mapping

• Loop optimizations

• Dependence analysis

• Data locality analysis

• GPU memory allocation

• Data-dependence analysis

• Dynamic memcpy reduction

CUDA Code Generation from GPU Coder app

Integrated editor

and simplified

workflow for code

generation

Summary- GPU Coder

MATLAB algorithm

(functional reference)

Functional test1 Deployment

unit-test

Desktop

Deployment

integration-test

Desktop

Real-time test4

Embedded GPU

.mex .lib Cross-compiled

Build type

Call CUDA

from MATLAB

directly

Call CUDA from

(C++) hand-

coded main()

Call CUDA from (C++)

hand-coded main().

MATLAB Deep Learning Framework

Access Data Design + Train

▪ Manage large image sets

▪ Automate image labeling

▪ Easy access to models

▪ Acceleration with GPU’s

▪ Scale to clusters

NVIDIA

TensorRT &

Libraries

MKL-DNN

Library

DEPLOYMENT

Compute

Library

• Share your experience with MATLAB & Simulink on Social Media

▪ Use #MATLABEXPO

▪ I use #MATLAB because……………………… Attending #MATLABEXPO▪ Examples

▪ I use #MATLAB because it helps me be a data scientist! Attending #MATLABEXPO

▪ Learning new capabilities in #MATLAB and #Simulink at #MATLABEXPO.

• Share your session feedback: Please fill in your feedback for this session in the feedback form

Speaker Details

Email: Rishu.Gupta@mathworks.in

LinkedIn: https://www.linkedin.com/in/rishu-gupta-72148914/

Contact MathWorks India

Products/Training Enquiry Booth

Call: 080-6632-5749

Email: info@mathworks.in

Deploying Deep Learning Networks to Embedded …...NVIDIA Jetson TX1 board Multi-Platform Deep...

Documents

Transcript of Deploying Deep Learning Networks to Embedded …...NVIDIA Jetson TX1 board Multi-Platform Deep...

Rudi Embedded System with NVIDIA Jetson TX2, TX2i or TX1 · Rudi Embedded System with NVIDIA Jetson TX2, TX2i or TX1 Connect Tech Inc. Tel: 519 -836 1291 42 Arrow Road Toll: 800-426-8979

DEEP LEARNING INTRODUCTION - NVIDIA Developer

Image Classification with DIGITS - NVIDIA · Image Classification with DIGITS ... deep neural networks. 9 ... NVIDIA DEEP LEARNING SDK COMPUTER VISION SPEECH & …

EX711-AA (Preliminary)storage.avermedia.com/web_release_www/EX711-AA/DS... · EX711-AA (Preliminary) Tegra TX1 Carrier Board with Multiple Video Sources Support • Operate with NVIDIA

LI-TX1-CB Data Sheet - Leopard Imaging · LI-TX1-CB LEOPARD IMAGING INC Data Sheet Rev 1.2 Technical details Carrier board for NVIDIA Jetson TX1/TX2 SOM 3 MIPI camera interfaces (4-lane)

NVIDIA Deep Learning SDK NVIDIA Deep Learning SDK DA-08640-001_v01 | 1 Chapter 1. INTRODUCTION The NVIDIA Deep Learning SDK provides powerful tools and …

NVIDIA DLI Workshopdoe.cusat.ac.in/Nvidia_DLI.pdf · 2017-10-07 · NVIDIA DLI Workshop DEEP LEARNING INSTITUTE Brought to you by 12-13 October 2017 The NVIDIA Deep Learning Institute

Deep Learning at Scale on NVIDIA V100 Accelerators · Deep Learning at Scale on NVIDIA V100 Accelerators HPC and AI Innovation Lab Rengan Xu, Frank Han, ... •Deep Learning Frameworks

Deep Visual Learning on Hypersphere - Nvidia

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers · 2018-10-30 · E_Masthead_Secondary_2-column_v2 Using the NVIDIA NGC Deep Learning Containers Run Tensorflow

Efficient Video Processing on Embedded GPU€¦ · Zürcher Fachhochschule Software Frameworks on TX1/TX2 • OS: Linux for Tegra (L4T) by Nvidia – Kernel 4.4.15 – Video Input:

High-Quality Self-Supervised Deep Image Denoising · High-Quality Self-Supervised Deep Image Denoising Samuli Laine NVIDIA Tero Karras NVIDIA Jaakko Lehtinen NVIDIA, Aalto University

Production-Level Facial Performance CaptureUsing Deep ... · Production-Level Facial Performance Capture Using Deep Convolutional Neural Networks Samuli Laine NVIDIA Tero Karras NVIDIA

Medical Image Processing on NVIDIA TK1/TX1

NVIDIA Containers For Deep Learning Frameworks · 2020-01-27 · Docker Containers NVIDIA Containers For Deep Learning Frameworks DU-08518-001_v001 | 2 that are not affected by the

TX1 Calibration Steps

LI-TX1-KIT-IMX185M12-H LEOPARD IMAGING INC Data Sheet … · LI-TX1-KIT-IMX185M12-H LEOPARD IMAGING INC Data Sheet Rev 1.0 Key Features Compatible with Nvidia Jetson TX1 MIPI interface

DGXUPDATE - NVIDIA...NVIDIA Docker GPU DRIVER NVIDIA Driver SYSTEM Host OS Advantages: Instant productivity with NVIDIA optimized deep learning frameworks Caffe, CNTK, MXNet, PyTorch,

NVIDIA DGX-1 ARTIFIcIAL INTELLIGENcE SYSTEM · 2019. 1. 6. · DEEP LEARNING LIBRARIES NVIDIA cuDNN and NCCL DEEP LEARNING USER SOFTWARE NVIDIA® DIGITS™ ACCELERATED SOLUTIONS CONTAINERIZATION

NVIDIA GPUs on OpenShift Deep Learning Workloads with · Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz