S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO...

May 2017 – Chris Gottbrath

S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO-SERVICES WITH TENSORRT, USER EXTENSIBLE LAYERS, AND GPU REST ENGINE

2

AGENDA

Inference

TensorRT

Custom Layer API

GPU REST Engine

Conclusion

3

NEW AI SERVICES POSSIBLE WITH GPU CLOUD

SPOTIFYSONG RECOMMENDATIONS

NETFLIXVIDEO RECOMMENDATIONS

YELPSELECTING COVER PHOTOS

4

DL INFERENCE POWERED API

1. Gather Data

2. Train Your Model

3. Load Your Trained Model into TensorRT

4. Extend TensorRT with your Custom Layer(s) (if necessary)

5. Optimize using TensorRT

6. Deploy using GRE

Step-by-step

5

WHAT IS TensorRT

6

NVIDIA TensorRTHigh-performance deep learning inference for production deployment

developer.nvidia.com/tensorrtGoogLeNet, Tesla P100 + TensorRT (FP16), Tesla K80 + TensorRT (FP32), CPU-Only + Caffe (FP32)

CPU: 1 Socket Broadwell E5-2690 [email protected] with HT off

0

200

400

600

800

1,000

1,200

1,400

1,600

Caffe on 28 coreE5-2690v4 x2b=121.3 ms

Caffeon K80b=125 ms

TensorRTon K80b=110 ms

TensorRTon P100 FP32b=25.4 ms

TensorRTon P100 FP16b=97 ms

Images/

Second

Up To 30x More Images/sec vs.

CPU-Only Inference

TensorRT Optimizer

TensorRT Runtime Engine

Trained Neural

Network

High performance neural network inference optimizer and

runtime engine for production deployment

Maximize inference throughput for latency-critical

services

7

TensorRTDevelopment Workflow

Training FrameworkOptimize

using TensorRTValidate

using TensorRTPLANNEURAL

NETWORK

developer.nvidia.com/tensorrt

Serialize to disk

Batch Size

Precision

8

TensorRTProduction Workflow

Inferusing TensorRT

Serialized PLAN


9

IMPORT MODEL

10

TO IMPORT A TRAINED MODEL TO TensorRT

IBuilder* builder = createInferBuilder(gLogger);

INetworkDefinition* network = builder->createNetwork();

CaffeParser parser;

auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);

network->markOutput(*blob_name_to_tensor->find(<output layer name>));

builder->setMaxBatchSize(<size>);

builder->setMaxWorkspaceSize(<size>);

ICudaEngine* engine = builder->buildCudaEngine(*network);

From Caffe

This assumes you have a Caffemodel file


Future: We are looking at a

streamlined graph input for

TensorFlow and other

frameworks too!

11

IMPORTING USING THE GRAPH DEFINITION API

If using other frameworks such as TensorFlow you can call our network builder API

ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});

IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);

From any framework


12

CUSTOM LAYER API

13

Application

CUSTOM LAYER API

Allows users to express and provide implementations of novel layers

• TensorRT provides APIs and implementations for most common layers

• Use the Custom Layer API for infrequent or more innovative layers

• Register custom implementations via a callback mechanism

• Can be used in conjunction with reduced precision optimizations

TensorRT

Cuda Runtime

Custom

Layer

14

CUSTOM LAYER API

Specify and build the network

getNbOutputs()

getOutputDimensions()

configure()

getWorkspaceSize()

Serialize the network

getSerializationSize()

serialize()

Runtime

initialize()

enqueue()

terminate()

Member functions of IPlugin objects

15

INTEGRATING PLUGINS AND CUSTOM LAYERS

When building the network

Directly register using the API

Plugin factory in caffe parser

Runtime

Runtime factory object

Two samples provided

Simple MNIST sample

Replaces the fully connected layer with a GEMM

Faster R-CNN

We provide a library with implementations of

The needed reshape layers ROIPooling

16

OPTIMIZE

17

TENSORRTOptimizations

• Fuse network layers

• Eliminate concatenation layers

• Apply reduced precision

• Specialized kernels

• Autotune for target platform

• Tune for given batch sizeTRAINEDNEURAL NETWORK

OPTIMIZEDINFERENCERUNTIME


18

BUILDING THE OPTIMIZED ENGINE

In General

IBuilder* builder = createInferBuilder(gLogger);

builder->setMaxBatchSize(maxBatchSize);

builder->setMaxWorkspaceSize(64 << 20);

auto engine = builder->buildCudaEngine(*network);

INT8

builder->setInt8Mode(true);

IInt8Calibrator* calibrator

builder->setInt8Calibrator(calibrator);

API calls


See the slides from session 7310 on the INT8 quantization for details!

19

EXECUTE THE NEURAL NETWORK

IExecutionContext *context = engine->createExecutionContext();

<index> = engine->getBindingIndex(<binding layer name>),

<malloc and cudaMalloc calls > //allocate buffers for data moving in and out using//the index from the getBindingIndex() call above

cudaStream_t stream;

cudaStreamCreate(&stream);

cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU

context.enqueue(<args>);

cudaMemcpyAsync( <args> )); // Copy Output Data to the Host

cudaStreamSynchronize(stream);

Running inference using the API

20

LOW LATENCY THROUGHPUT

0

500

1000

1500

2000

2500

0 5 10 15 20 25 30 35 40 45 50

Caffe FP32 on CPU TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Thro

ughput

(im

ages/

s)

Batch Size

Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

1

2

4

8

1632

64

Latency (ms)

21


0

500

1000

1500

2000

2500

0 5 10 15 20 25 30 35 40 45 50




Thro

ughput

(im

ages/

s)

Batch Size

Latency (ms)

Same Batch Size

22


0

500

1000

1500

2000

2500

0 5 10 15 20 25 30 35 40 45 50




Thro

ughput

(im

ages/

s)

Batch Size

Latency (ms)

More and faster

23

DEPLOYING TensorRTAS A MICROSERVICE WITHGPU REST ENGINE (GRE)

24

GPU REST ENGINE (GRE) SDK

Accelerated microservices for web and mobile

Supercomputer performance for hyperscale datacenters

Up to 50 teraflops per node, min ~250μs response time

Easy to develop new microservices

Open source, integrates with existing infrastructure

Easy to deploy & scale

Ready-to-run Dockerfile

HTTP (~250μs)

GPU REST Engine

Image

Classification

Speech

Recognition…

Image

Scaling

developer.nvidia.com/gre



25

Context Pool

Request ScopedContextRequest ScopedContext

GPU1

GPU2

Context

Context

Request ScopedContext



Context


Context




RESOURCEPOOL




26

ScopedContext<>

REST API

HTTP layer

App layer

Device-layer

func TensorRTInference

classifier_classify()

Microservice

Client

Go

C++

CUDA device GPU

Host CPU

Host CPU

classify()

CLASSIFICATIONMICROSERVICE




27

CLASSIFICATION.CPP (1/2)

func classify


classify()

constexpr static int kContextsPerDevice = 2;

classifier_ctx* classifier_initialize(char* model_file, char* trained_file,char* mean_file, char* label_file)

{try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<ExecContext> pool;for (int dev = 0; dev < device_count; ++dev) {

std::shared_ptr<InferenceEngine> engine(new InferenceEngine(model_file, trained_file));

for (int i = 0; i < kContextsPerDevice; ++i) {std::unique_ptr<CaffeContext> context(new ExecContext(engine,

Mean_file,label_file, dev));

pool.Push(std::move(context));}}} catch { ... }

}

To allow latency hiding

One per context


One per GPU



28

CLASSIFICATION.CPP (2/2)

func classify


classify()

const char* classifier_classify(classifier_ctx* ctx,char* buffer, size_t length)

{try{

{ ScopedContext<ExecContext> context(ctx->pool);auto classifier = context->TensorRTClassifier();predictions = classifier->Classify(img);

}

/* Write the top N predictions in JSON format. */}

Uses a scoped context

Lower level classify routine




29

CONCLUSION

30

CONCLUSION

Inference powers an increasing number of features and capabilities

Cutting edge networks including custom layers can be deployed in TensorRT

TensorRT leverages GPU power to deliver throughput and low latency

GRE is a template to follow for creating accelerated microservices




31

WANT TO LEARN MORE?


developer.nvidia.com/greJust Updated with the code shown today!

devblogs.nvidia.com/parallelforall/

NVIDIA Jetson TX2 Delivers Twice …

Production Deep Learning …

www.nvidia.com/en-us/deep-learning-ai/education/

github.com/dusty-nv/jetson-inference

Here at GTC

S7310 8-bit Inference with TensorRTMonday morning, but check web for slides and recording

H7126 Deep Learning Inference with TensorRTWed 3PM Lower Level Pod B

L7123 Neural Network Deployment with DIGITS and TensorRTWed Noon LL21E

Resources to check out




[email protected]

THANKS

35

RESOURCE SLIDES

37

INT8 INFERENCE

• Main challenge

• INT8 has significantly lower precision and dynamic range compared to FP32

• Requires “smart” quantization and calibration from FP32 to INT8

Challenge

Dynamic Range Min Pos Value

FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45

FP16 -65504 ~ +65504 5.96 x 10-8

INT8 -128 ~ +127 1


38

QUANTIZATION OF WEIGHTS

-127 -126 -125 125 126 127

I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )

scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )

Symmetric, Linear Quantization

[-127, 127]

39NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

QUANTIZATION OF ACTIVATIONS

I8_value = (value > threshold) ?

threshold :

scale * F32_value

How do you decide optimal ‘threshold’?

Activation range is unknown offline, input dependent

Calibration using ‘representative’ dataset

? ? ?

Input

41

TENSORRTINT8 Workflow

FP32Training Framework

INT8 OPTIMIZATION USING TensorRT

INT8 RUNTIMEUSING TensorRT

INT8PLAN

FP32 NEURALNETWORK


Calibration

Dataset

Batch Size

Precision

42

8-BIT INFERENCETop-1 Accuracy

Network FP32 Top1 INT8 Top1 Difference Perf Gain


43

main.go

func EmptyKernel_Handler

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) {

C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0])))

io.WriteString(w, string(message[:]))}

func main() {

http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler)

http.ListenAndServe(":8000", nil) }

Calls the C func

Execute server

Set API URL

44

benchmark.cpp (1/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

constexpr static int kContextsPerDevice = 4;

benchmark_ctx* benchmark_initialize(){

cudaGetDeviceCount(&device_count);

ContextPool<BenchmarkContext> pool;for (int dev = 0; dev < device_count; ++dev)

for (int i = 0; i < kContextsPerDevice; ++i)std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev));pool.Push(std::move(context));

}

4 per GPU

Get # GPUs

Create pool


45

benchmark.cpp (2/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>


void benchmark_execute(benchmark_ctx* ctx, char* message){

ScopedContext<BenchmarkContext> context(ctx->pool);

cudaStream_t stream = context->CUDAStream();

kernel_wrapper(stream, message);

}

Scoped Context

Run the wrapper

46

kernel.cukernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>


__global__ void empty_kernel(char* device_message){

const char message[50] = "Hello world from an (almost) empty CUDA kernel :)";

for(int i=0;i<50;i++){ device_message[i] = message[i];if(message[i]=='\0') break;

}}

void kernel_wrapper(cudaStream_t stream, char* message){

cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault);

host_message = (char*)malloc(message_size);

empty_kernel<<<1, 1, 0, stream>>>(device_message);

cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost);

strncpy(message, host_message, message_size);}

GPU code

Device call

Host side wrapper

47

TensorRT

• Convolution: Currently only 2D convolutions

• Activation: ReLU, tanh and sigmoid

• Pooling: max and average

• Scale: similar to Caffe Power layer (shift+scale*x)^p

• ElementWise: sum, product or max of two tensors

• LRN: cross-channel only

• Fully-connected: with or without bias

• SoftMax: cross-channel only

• Deconvolution

Layers Types Supported

48

GRAPH OPTIMIZATIONUnoptimized network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

49

GRAPH OPTIMIZATIONVertical fusion

concat

max pool

input

next input

concat

1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR 1x1 CBR

50

GRAPH OPTIMIZATIONHorizontal fusion

concat

max pool

input

next input

concat

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

51

GRAPH OPTIMIZATIONConcat elision

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

52

IDP.4A – 8 BIT INSTRUCTION

i8 i8 i8 i8

× × × ×

i8 i8 i8 i8

i32 + i32

53

SMALLER AND FASTER

0

0.5

1

1.5

2

2.5

3

3.5

FP32 FP16 on P100 INT8 on P40

Performance

% s

cale

d t

o F

P32

ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease

0

20

40

60

80

100

120

FP32 FP16 on P100 INT8 on P40

Memory Usage

Images

/ s

-

Scale

d t

o F

P32


54

Int8 precisionNew in TensorRT

ACCURACYEFFICIENCYPERFORMANCE

0

1000

2000

3000

4000

5000

6000

7000

2 4 128

FP32 INT8

Up To 3x More Images/sec with INT8

Precision

Batch Size

GoogLenet, FP32 vs INT8 precision + TensorRT on

Tesla P40 GPU, 2 Socket Haswell E5-2698 [email protected] with HT off

Images/

Second

0

200

400

600

800

1000

1200

1400

2 4 128

FP32 INT8

Deploy 2x Larger Models with INT8

Precision

Batch Size

Mem

ory

(M

B)

0%

20%

40%

60%

80%

100%

Top 1Accuracy

Top 5Accuracy

FP32 INT8

Deliver full accuracy with INT8

precision

% A

ccura

cy

55

THROUGHPUT

0

500

1000

1500

2000

2500

1 2 4 8 16 32 64 128



Images

/ s

Batch Size


56

LATENCY

1

10

100

1000

10000

1 2 4 8 16 32 64 128



Late

ncy (

ms)

Batch Size


58

NVIDIA TensorRTHigh-performance deep learning inference for production deployment


High performance neural network inference engine for production deployment

Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms

Deliver high-performance, low-latency inference demanded by real-time services

Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

2 8 128

CPU-Only

Tesla P40 + TensorRT (FP32)

Tesla P40 + TensorRT (INT8)

Up to 36x More Image/sec

Batch Size

GoogLenet, CPU-only vs Tesla P40 + TensorRTCPU: 1 socket E4 2690 v4 @2.6 GHz, HT-onGPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box

Images/

Second

S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO...

Documents

Transcript of S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO...