DeepStream SDK: Towards Large Scale Deployment of...

November 1, 2017

DeepStream SDK:Towards Large Scale Deployment of Intelligent Video Analytics Systems

Jeremy Furtek

2

MOTIVATION

3

Proliferation of Video Data

Content Filtering

Retail Analytics

Traffic Engineering

Law Enforcement

Ad Injection

Securing Critical Infrastructure

Parking Management

In-Vehicle Analytics

4

Advances in Deep Learning: Classification and Beyond

Image Classification(AlexNet, 2012)

Object Detection(R-FCN, 2016)

Semantic Image Segmentation(ENet, 2016)

Image Compression(WaveOne, 2017)

Structure From Motion(SfM-Net, 2017)

Pose-Aware Face Recognition(USC, 2017)

5

WHAT IS DEEPSTREAM?

HOW IT WORKS

DEEP LEARNING

HOW IT WORKS

VIDEO

DEEP LEARNING WITH DEEPSTREAM

DeepStream EnabledApp or Service

VIDEO ANALYTICS WITH DEEP LEARNING ON NVIDIA GPUS: AN OVERVIEW

9

From Video to Analytics With Deep Learning…

compressed video

bitstream

decoded videoframe

(YCbCr image)

Video

Decode

1 0 1 1 0…

10


compressed video

bitstream

decoded videoframe

(YCbCr image)

Video

Decode

1 0 1 1 0…

Video Decode

Fully-accelerated hardware decoding on NVIDIAGPUs (NVDEC) is accessible via the Video DecodeAPI.

11


compressed video

bitstream

decoded videoframe

BGRimage

(YCbCr image)

Image

Conversion

Video

Decode

1 0 1 1 0…

12


compressed video

bitstream

decoded videoframe

BGRimage

(YCbCr image)

Image

Conversion

Video

Decode

1 0 1 1 0…

Image Conversion

Custom CUDA kernels and/or optimized libraries (e.g. NVIDIA Performance Primitives, or NPP) enable high performance image conversions.

__global__ void resize(uint8_t* dst,

const uint8_t* src,

int dst_w,

int dst_h,

int src_w,

int src_h)

{

…

}

13


compressed video

bitstream

decoded videoframe

BGRimage

(YCbCr image)

Image

Conversion

Video

Decode

1 0 1 1 0…

inferenceresults

cat

dog

tree

0.9

0.05

0.05

Inference

14


compressed video

bitstream

decoded videoframe

BGRimage

(YCbCr image)

Image

Conversion

Video

Decode

1 0 1 1 0…

inferenceresults

cat

dog

tree

0.9

0.05

0.05

Inference

InferenceTensorRT* provides a neural network optimizer and a high performance runtime library for inference deployment on NVIDIA GPUs.

Step 1: Optimize a trained neural network Step 2: Perform real-time inference with the TensorRT runtime

*The current DeepStream release supports TensorRT 2.1.

15


compressed video

bitstream

decoded videoframe

BGRimage

inferenceresults

(YCbCr image)

cat

dog

tree

0.9

0.05

0.05

Image

ConversionInference

Display /

StorageVideo

Decode

1 0 1 1 0…

16


compressed video

bitstream

decoded videoframe

BGRimage

inferenceresults

(YCbCr image)

cat

dog

tree

0.9

0.05

0.05

Image

ConversionInference

Display /

StorageVideo

Decode

1 0 1 1 0…

Display/StorageProcessing of inference output is application dependent.

Some common options are:

• Real-time display with overlay• Alert generation• Database insertion• Augmented video encoding• “Slew-to-Cue” camera control• Tracker update• Custom post-processing algorithms (CUDA, OpenCV, …)

17

Video to Analytics With the DeepStream SDK

DeepStream runtime library (libdeepstream)

compressed video

bitstream

decoded videoframe

BGRimage

inferenceresults

(YCbCr image)

cat

dog

tree

0.9

0.05

0.05

Image

ConversionInference

Display /

StorageVideo

Decode

1 0 1 1 0…

18

USING THE DEEPSTREAM LIBRARY

19

DeepStream from the Ground Up

DeepStream tensors are objects that implement the IStreamTensor interface and represent a 4-dimensional array of values stored in NCHW order.

Tensors

compressed video

bitstream

W

H

C

N

typedef enum {

FLOAT_TENSOR, // float32

NV12_FRAME, // decoded YCbCr

OBJ_COORD, // detection bounding box

CUSTOMER_TYPE // user defined

} TENSOR_TYPE;

typedef enum {

GPU_DATA, // device memory

CPU_DATA, // host memory

} MEMORY_TYPE;

IStreamTensor* createStreamTensor(int maxlen,

size_t elemsize,

TENSOR_TYPE Ttype,

MEMORY_TYPE Mtype,

int deviceID);

20


DeepStream modules are objects that implement the IModule interface and operate on one or more input tensors to produce one or more output tensors.

Modules

void IModule::execute(const ModuleContext& ctx,

const std::vector<IStreamTensor*>& in,

const std::vector<IStreamTensor*>& out) = 0;

DeepStream

Module

input[0]

input[1]

input[M -1]

…

output[0]

output[1]

output[N -1]

…

21


DeepStream modules maintain connectivity information using a std::vector of <IModule*, int> pairs that represent the inputs to the module.

Building the DeepStream dataflow graph

Module CModule A A.output[0]

A.output[1]

Module B

output[0]

B.output[0]

B.output[1]

C.input[0]

C.input[1]

IModule* index

&A 1

&B 0

Inputs:

22

Using the DeepStream Library

// Create a device worker

IDeviceWorker* pDW = createDeviceWorker(1, // num channels

0); // GPU device ID

Step 1: Create a device worker object

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

23



IDeviceWorker* pDW = createDeviceWorker(1, 0);

// Add a decode task, specifying the CODEC type

pDW->addDecodeTask(cudaVideoCodec_H264);

Step 2: Add a decode task

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

compressed video

bitstream

decoded videoframe

(YCbCr image)

Video

Decode

1 0 1 1 0…

24



IDeviceWorker* pDW = createDeviceWorker(1, 0);

// Add a decode task, specifying the CODEC type

pDW->addDecodeTask(cudaVideoCodec_H264);

// Add a module for color conversion.

// Color conversion module outputs:

// 0: BGR_PLANAR

// 1: NV12 (YCbCr)

IModule* pCC = pDW->addColorSpaceConvertorTask(BGR_PLANAR);

Step 3: Add an image conversion module

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

compressed video

bitstream

decoded videoframe

BGRimage

(YCbCr image)

Image

Conversion

Video

Decode

1 0 1 1 0…

Caffe uses BGR ordering(not RGB)

25

Inference in DeepStream: CAFFE and TensorRT

solver

parameters

network

description

(training)

CAFFE

caffe train –solver solver.prototxt

trained

weightssolver.prototxt

network.prototxt

trained.caffemodel(BINARYPROTO)

TensorRT

network

description

(deploy)

deploy.protoxt

CUDA

Engine

in-memory(serializable)

26


// Add a module for color conversion.

// Color conversion module outputs:

// 0: BGR_PLANAR

// 1: NV12 (YCbCr)

IModule* pCC = pDW->addColorSpaceConvertorTask(BGR_PLANAR);

// Add an inference module using output 0 from color conversion

std::vector<std::pair<IModule*,int>> inf_inputs{std::make_pair(pCC, 0)};

IModule* pInf = pDW->addInferenceTask(inf_inputs,

prototxt_file.c_str(),

model_file.c_str(),

input_layer_name,

output_layer_names_vector,

num_channels,

&inference_params);

Step 4: Add an inference module

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

compressed video

bitstream

decoded videoframe

BGRimage

inferenceresults

(YCbCr image)

cat

dog

tree

0.9

0.05

0.05

Image

ConversionInference

Video

Decode

1 0 1 1 0…

27


// Add a parser module using output 0 from inference

std::vector<std::pair<IModule*,int>> parser_inputs;

parser_inputs.push_back(std::make_pair(pInf, 0));

IModule* pParse = new UserInferenceParser(parser_inputs);

pDW->addCustomerTask(pParse);

// Add a display module using image and parser outputs

std::vector<std::pair<IModule*,int>> display_inputs;

display_inputs.push_back(std::make_pair(pCC, 0));

display_inputs.push_back(std::make_pair(pParse, 0));

IModule* pDisp = new UserDisplayModule(display_inputs);

pDW->addCustomerTask(pDisp);

Step 5: Add custom output processor(s)

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

compressed video

bitstream

decoded videoframe

BGRimage

inferenceresults

(YCbCr image)

Image

ConversionInference Parser

Video

Decode

1 0 1 1 0…

Display /

Storage

28


// Add a display module using image conversion and parser outputs

std::vector<std::pair<IModule*,int>> display_inputs;

display_inputs.push_back(std::make_pair(pCC, 0)); // BGR output

display_inputs.push_back(std::make_pair(pParse, 0));

IModule* pDisp = new UserDisplayModule(display_inputs);

pDW->addCustomerTask(pDisp);

// Start the device worker...

pDW->start();

Step 6: Start the device worker

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

29


// Start the device worker...

pDW->start();

// Initiate data input by pushing data to the device worker

while(1)

{

// Read data from file/socket/database...

pDW->pushPacket(data, num_bytes, channel_num);

}

Step 7: Initiate data input

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

30

SCALING WITH DEEPSTREAM

31

DECODE PERFORMANCE ON NVIDIA GPUS

32

Scaling Video Analytics With DeepStream: Multiple Channels

GPU 0DeepStream runtime library (libdeepstream)

Display /

Storage

Image Conversion InferenceVideo Decode




… … … ……

channel 0

channel 1

channel 2

channel 3

33

Faster Inference with INT8 Integers

GPUs with Compute Capability 6.1 and higher support dot product instructions with 8-bit operands. (Examples of GPUs with this capability are Tesla P4 and P40.)

34

Inference with INT8: Accuracy

From: “8-bit Inference with TensorRT,” Szymon Migacz, NVIDIA, GTC2017.

35

Inference with INT8: Performance

From: “8-bit Inference with TensorRT,” Szymon Migacz, NVIDIA, GTC2017.

36

Using INT8 Inference in DeepStreamStep 1: Run TensorRT offline to generate a calibration table file

trained

weights

trained.caffemodel(BINARYPROTO)

TensorRT

network

description

(deploy)

deploy.protoxt

CUDA

Engine

calibration

table

37

Using int8 Inference in DeepStreamStep 2: Provide the path to the calibration file when creating the

inference task

// Add an inference module using output 0 from color conversion

std::vector<std::pair<IModule*,int>> inf_inputs{std::make_pair(pCC, 0)};

inference_params.calibrationTableFile = calibration_file.c_str();

IModule* pInf = pDW->addInferenceTask(inf_inputs,

prototxt_file.c_str(),

model_file.c_str(),

input_layer_name,

output_layer_names_vector,

num_channels,

&inference_params);

38

GPU 1

Scaling Video Analytics With DeepStream: Multiple GPUs

GPU 0DeepStream device worker (device ID = 0)

Display /

Storage



… … … ……

channel 0

channel 1

…

DeepStream device worker (device ID = 1)



… … … ……

channel 0

channel 1

39

Examples (Videos)

40

DeepStream Resources• DeepStream SDK Download

https://developer.nvidia.com/deepstream-sdk

• NVIDIA Parallel For All Blog:

DeepStream: Next Generation Video Analytics for Smart Citieshttps://devblogs.nvidia.com/parallelforall/deepstream-video-analytics-smart-cities/

• NVIDIA Developer Forums

https://devtalk.nvidia.com/

• Caffe Model Zoo

https://github.com/BVLC/caffe/wiki/Model-Zoo

• NVIDIA Video Encode and Decode GPU Support Matrix

https://developer.nvidia.com/video-encode-decode-gpu-support-matrix

https://developer.nvidia.com/deepstream-sdk

https://devblogs.nvidia.com/parallelforall/deepstream-video-analytics-smart-cities/

https://devtalk.nvidia.com/

https://github.com/BVLC/caffe/wiki/Model-Zoo

https://developer.nvidia.com/video-encode-decode-gpu-support-matrix

Fundamentals

Parallel Computing

Game Development & Digital Content

Finance

NVIDIA DEEP LEARNING INSTITUTE

Training available as online self-paced labs and instructor-led workshops

Take self-paced labs at www.nvidia.com/dlilabs

Find or request an instructor-led workshop at www.nvidia.com/dli

Educators can download the Teaching Kit at developer.nvidia.com/teaching-kit and contact [email protected] for info on the University Ambassador Program

Intelligent Video Analytics

Healthcare

Robotics

Autonomous Vehicles

Virtual Reality

DeepStream SDK: Towards Large Scale Deployment of...

Documents

Transcript of DeepStream SDK: Towards Large Scale Deployment of...