DeepStream SDK: Towards Large Scale Deployment of...
Transcript of DeepStream SDK: Towards Large Scale Deployment of...
November 1, 2017
DeepStream SDK:Towards Large Scale Deployment of Intelligent Video Analytics Systems
Jeremy Furtek
2
MOTIVATION
3
Proliferation of Video Data
Content Filtering
Retail Analytics
Traffic Engineering
Law Enforcement
Ad Injection
Securing Critical Infrastructure
Parking Management
In-Vehicle Analytics
4
Advances in Deep Learning: Classification and Beyond
Image Classification(AlexNet, 2012)
Object Detection(R-FCN, 2016)
Semantic Image Segmentation(ENet, 2016)
Image Compression(WaveOne, 2017)
Structure From Motion(SfM-Net, 2017)
Pose-Aware Face Recognition(USC, 2017)
5
WHAT IS DEEPSTREAM?
HOW IT WORKS
DEEP LEARNING
HOW IT WORKS
VIDEO
DEEP LEARNING WITH DEEPSTREAM
DeepStream EnabledApp or Service
VIDEO ANALYTICS WITH DEEP LEARNING ON NVIDIA GPUS: AN OVERVIEW
9
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
(YCbCr image)
Video
Decode
1 0 1 1 0…
10
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
(YCbCr image)
Video
Decode
1 0 1 1 0…
Video Decode
Fully-accelerated hardware decoding on NVIDIAGPUs (NVDEC) is accessible via the Video DecodeAPI.
11
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
BGRimage
(YCbCr image)
Image
Conversion
Video
Decode
1 0 1 1 0…
12
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
BGRimage
(YCbCr image)
Image
Conversion
Video
Decode
1 0 1 1 0…
Image Conversion
Custom CUDA kernels and/or optimized libraries (e.g. NVIDIA Performance Primitives, or NPP) enable high performance image conversions.
__global__ void resize(uint8_t* dst,
const uint8_t* src,
int dst_w,
int dst_h,
int src_w,
int src_h)
{
…
}
13
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
BGRimage
(YCbCr image)
Image
Conversion
Video
Decode
1 0 1 1 0…
inferenceresults
cat
dog
tree
0.9
0.05
0.05
Inference
14
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
BGRimage
(YCbCr image)
Image
Conversion
Video
Decode
1 0 1 1 0…
inferenceresults
cat
dog
tree
0.9
0.05
0.05
Inference
InferenceTensorRT* provides a neural network optimizer and a high performance runtime library for inference deployment on NVIDIA GPUs.
Step 1: Optimize a trained neural network Step 2: Perform real-time inference with the TensorRT runtime
*The current DeepStream release supports TensorRT 2.1.
15
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
BGRimage
inferenceresults
(YCbCr image)
cat
dog
tree
0.9
0.05
0.05
Image
ConversionInference
Display /
StorageVideo
Decode
1 0 1 1 0…
16
From Video to Analytics With Deep Learning…
compressed video
bitstream
decoded videoframe
BGRimage
inferenceresults
(YCbCr image)
cat
dog
tree
0.9
0.05
0.05
Image
ConversionInference
Display /
StorageVideo
Decode
1 0 1 1 0…
Display/StorageProcessing of inference output is application dependent.
Some common options are:
• Real-time display with overlay• Alert generation• Database insertion• Augmented video encoding• “Slew-to-Cue” camera control• Tracker update• Custom post-processing algorithms (CUDA, OpenCV, …)
17
Video to Analytics With the DeepStream SDK
DeepStream runtime library (libdeepstream)
compressed video
bitstream
decoded videoframe
BGRimage
inferenceresults
(YCbCr image)
cat
dog
tree
0.9
0.05
0.05
Image
ConversionInference
Display /
StorageVideo
Decode
1 0 1 1 0…
18
USING THE DEEPSTREAM LIBRARY
19
DeepStream from the Ground Up
DeepStream tensors are objects that implement the IStreamTensor interface and represent a 4-dimensional array of values stored in NCHW order.
Tensors
compressed video
bitstream
W
H
C
N
typedef enum {
FLOAT_TENSOR, // float32
NV12_FRAME, // decoded YCbCr
OBJ_COORD, // detection bounding box
CUSTOMER_TYPE // user defined
} TENSOR_TYPE;
typedef enum {
GPU_DATA, // device memory
CPU_DATA, // host memory
} MEMORY_TYPE;
IStreamTensor* createStreamTensor(int maxlen,
size_t elemsize,
TENSOR_TYPE Ttype,
MEMORY_TYPE Mtype,
int deviceID);
20
DeepStream from the Ground Up
DeepStream modules are objects that implement the IModule interface and operate on one or more input tensors to produce one or more output tensors.
Modules
void IModule::execute(const ModuleContext& ctx,
const std::vector<IStreamTensor*>& in,
const std::vector<IStreamTensor*>& out) = 0;
DeepStream
Module
input[0]
input[1]
input[M -1]
…
output[0]
output[1]
output[N -1]
…
21
DeepStream from the Ground Up
DeepStream modules maintain connectivity information using a std::vector of <IModule*, int> pairs that represent the inputs to the module.
Building the DeepStream dataflow graph
Module CModule A A.output[0]
A.output[1]
Module B
output[0]
B.output[0]
B.output[1]
C.input[0]
C.input[1]
IModule* index
&A 1
&B 0
Inputs:
22
Using the DeepStream Library
// Create a device worker
IDeviceWorker* pDW = createDeviceWorker(1, // num channels
0); // GPU device ID
Step 1: Create a device worker object
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
23
Using the DeepStream Library
// Create a device worker
IDeviceWorker* pDW = createDeviceWorker(1, 0);
// Add a decode task, specifying the CODEC type
pDW->addDecodeTask(cudaVideoCodec_H264);
Step 2: Add a decode task
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
compressed video
bitstream
decoded videoframe
(YCbCr image)
Video
Decode
1 0 1 1 0…
24
Using the DeepStream Library
// Create a device worker
IDeviceWorker* pDW = createDeviceWorker(1, 0);
// Add a decode task, specifying the CODEC type
pDW->addDecodeTask(cudaVideoCodec_H264);
// Add a module for color conversion.
// Color conversion module outputs:
// 0: BGR_PLANAR
// 1: NV12 (YCbCr)
IModule* pCC = pDW->addColorSpaceConvertorTask(BGR_PLANAR);
Step 3: Add an image conversion module
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
compressed video
bitstream
decoded videoframe
BGRimage
(YCbCr image)
Image
Conversion
Video
Decode
1 0 1 1 0…
Caffe uses BGR ordering(not RGB)
25
Inference in DeepStream: CAFFE and TensorRT
solver
parameters
network
description
(training)
CAFFE
caffe train –solver solver.prototxt
trained
weightssolver.prototxt
network.prototxt
trained.caffemodel(BINARYPROTO)
TensorRT
network
description
(deploy)
deploy.protoxt
CUDA
Engine
in-memory(serializable)
26
Using the DeepStream Library
// Add a module for color conversion.
// Color conversion module outputs:
// 0: BGR_PLANAR
// 1: NV12 (YCbCr)
IModule* pCC = pDW->addColorSpaceConvertorTask(BGR_PLANAR);
// Add an inference module using output 0 from color conversion
std::vector<std::pair<IModule*,int>> inf_inputs{std::make_pair(pCC, 0)};
IModule* pInf = pDW->addInferenceTask(inf_inputs,
prototxt_file.c_str(),
model_file.c_str(),
input_layer_name,
output_layer_names_vector,
num_channels,
&inference_params);
Step 4: Add an inference module
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
compressed video
bitstream
decoded videoframe
BGRimage
inferenceresults
(YCbCr image)
cat
dog
tree
0.9
0.05
0.05
Image
ConversionInference
Video
Decode
1 0 1 1 0…
27
Using the DeepStream Library
// Add a parser module using output 0 from inference
std::vector<std::pair<IModule*,int>> parser_inputs;
parser_inputs.push_back(std::make_pair(pInf, 0));
IModule* pParse = new UserInferenceParser(parser_inputs);
pDW->addCustomerTask(pParse);
// Add a display module using image and parser outputs
std::vector<std::pair<IModule*,int>> display_inputs;
display_inputs.push_back(std::make_pair(pCC, 0));
display_inputs.push_back(std::make_pair(pParse, 0));
IModule* pDisp = new UserDisplayModule(display_inputs);
pDW->addCustomerTask(pDisp);
Step 5: Add custom output processor(s)
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
compressed video
bitstream
decoded videoframe
BGRimage
inferenceresults
(YCbCr image)
Image
ConversionInference Parser
Video
Decode
1 0 1 1 0…
Display /
Storage
28
Using the DeepStream Library
// Add a display module using image conversion and parser outputs
std::vector<std::pair<IModule*,int>> display_inputs;
display_inputs.push_back(std::make_pair(pCC, 0)); // BGR output
display_inputs.push_back(std::make_pair(pParse, 0));
IModule* pDisp = new UserDisplayModule(display_inputs);
pDW->addCustomerTask(pDisp);
// Start the device worker...
pDW->start();
Step 6: Start the device worker
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
29
Using the DeepStream Library
// Start the device worker...
pDW->start();
// Initiate data input by pushing data to the device worker
while(1)
{
// Read data from file/socket/database...
pDW->pushPacket(data, num_bytes, channel_num);
}
Step 7: Initiate data input
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
30
SCALING WITH DEEPSTREAM
31
DECODE PERFORMANCE ON NVIDIA GPUS
32
Scaling Video Analytics With DeepStream: Multiple Channels
GPU 0DeepStream runtime library (libdeepstream)
Display /
Storage
Image Conversion InferenceVideo Decode
Image Conversion InferenceVideo Decode
Image Conversion InferenceVideo Decode
Image Conversion InferenceVideo Decode
… … … ……
channel 0
channel 1
channel 2
channel 3
33
Faster Inference with INT8 Integers
GPUs with Compute Capability 6.1 and higher support dot product instructions with 8-bit operands. (Examples of GPUs with this capability are Tesla P4 and P40.)
34
Inference with INT8: Accuracy
From: “8-bit Inference with TensorRT,” Szymon Migacz, NVIDIA, GTC2017.
35
Inference with INT8: Performance
From: “8-bit Inference with TensorRT,” Szymon Migacz, NVIDIA, GTC2017.
36
Using INT8 Inference in DeepStreamStep 1: Run TensorRT offline to generate a calibration table file
trained
weights
trained.caffemodel(BINARYPROTO)
TensorRT
network
description
(deploy)
deploy.protoxt
CUDA
Engine
calibration
table
37
Using int8 Inference in DeepStreamStep 2: Provide the path to the calibration file when creating the
inference task
// Add an inference module using output 0 from color conversion
std::vector<std::pair<IModule*,int>> inf_inputs{std::make_pair(pCC, 0)};
inference_params.calibrationTableFile = calibration_file.c_str();
IModule* pInf = pDW->addInferenceTask(inf_inputs,
prototxt_file.c_str(),
model_file.c_str(),
input_layer_name,
output_layer_names_vector,
num_channels,
&inference_params);
38
GPU 1
Scaling Video Analytics With DeepStream: Multiple GPUs
GPU 0DeepStream device worker (device ID = 0)
Display /
Storage
Image Conversion InferenceVideo Decode
Image Conversion InferenceVideo Decode
… … … ……
channel 0
channel 1
…
DeepStream device worker (device ID = 1)
Image Conversion InferenceVideo Decode
Image Conversion InferenceVideo Decode
… … … ……
channel 0
channel 1
39
Examples (Videos)
40
DeepStream Resources• DeepStream SDK Download
https://developer.nvidia.com/deepstream-sdk
• NVIDIA Parallel For All Blog:
DeepStream: Next Generation Video Analytics for Smart Citieshttps://devblogs.nvidia.com/parallelforall/deepstream-video-analytics-smart-cities/
• NVIDIA Developer Forums
https://devtalk.nvidia.com/
• Caffe Model Zoo
https://github.com/BVLC/caffe/wiki/Model-Zoo
• NVIDIA Video Encode and Decode GPU Support Matrix
https://developer.nvidia.com/video-encode-decode-gpu-support-matrix
Fundamentals
Parallel Computing
Game Development & Digital Content
Finance
NVIDIA DEEP LEARNING INSTITUTE
Training available as online self-paced labs and instructor-led workshops
Take self-paced labs at www.nvidia.com/dlilabs
Find or request an instructor-led workshop at www.nvidia.com/dli
Educators can download the Teaching Kit at developer.nvidia.com/teaching-kit and contact [email protected] for info on the University Ambassador Program
Intelligent Video Analytics
Healthcare
Robotics
Autonomous Vehicles
Virtual Reality