S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO...

53
May 2017 – Chris Gottbrath S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO-SERVICES WITH TENSORRT, USER EXTENSIBLE LAYERS, AND GPU REST ENGINE

Transcript of S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO...

Page 1: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

May 2017 – Chris Gottbrath

S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO-SERVICES WITH TENSORRT, USER EXTENSIBLE LAYERS, AND GPU REST ENGINE

Page 2: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

2

AGENDA

Inference

TensorRT

Custom Layer API

GPU REST Engine

Conclusion

Page 3: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

3

NEW AI SERVICES POSSIBLE WITH GPU CLOUD

SPOTIFYSONG RECOMMENDATIONS

NETFLIXVIDEO RECOMMENDATIONS

YELPSELECTING COVER PHOTOS

Page 4: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

4

DL INFERENCE POWERED API

1. Gather Data

2. Train Your Model

3. Load Your Trained Model into TensorRT

4. Extend TensorRT with your Custom Layer(s) (if necessary)

5. Optimize using TensorRT

6. Deploy using GRE

Step-by-step

Page 5: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

5

WHAT IS TensorRT

Page 6: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

6

NVIDIA TensorRTHigh-performance deep learning inference for production deployment

developer.nvidia.com/tensorrtGoogLeNet, Tesla P100 + TensorRT (FP16), Tesla K80 + TensorRT (FP32), CPU-Only + Caffe (FP32)

CPU: 1 Socket Broadwell E5-2690 [email protected] with HT off

0

200

400

600

800

1,000

1,200

1,400

1,600

Caffe on 28 coreE5-2690v4 x2b=121.3 ms

Caffeon K80b=125 ms

TensorRTon K80b=110 ms

TensorRTon P100 FP32b=25.4 ms

TensorRTon P100 FP16b=97 ms

Images/

Second

Up To 30x More Images/sec vs.

CPU-Only Inference

TensorRT Optimizer

TensorRT Runtime Engine

Trained Neural

Network

High performance neural network inference optimizer and

runtime engine for production deployment

Maximize inference throughput for latency-critical

services

Page 7: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

7

TensorRTDevelopment Workflow

Training FrameworkOptimize

using TensorRTValidate

using TensorRTPLANNEURAL

NETWORK

developer.nvidia.com/tensorrt

Serialize to disk

Batch Size

Precision

Page 8: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

8

TensorRTProduction Workflow

Inferusing TensorRT

Serialized PLAN

developer.nvidia.com/tensorrt

Page 9: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

9

IMPORT MODEL

Page 10: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

10

TO IMPORT A TRAINED MODEL TO TensorRT

IBuilder* builder = createInferBuilder(gLogger);

INetworkDefinition* network = builder->createNetwork();

CaffeParser parser;

auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);

network->markOutput(*blob_name_to_tensor->find(<output layer name>));

builder->setMaxBatchSize(<size>);

builder->setMaxWorkspaceSize(<size>);

ICudaEngine* engine = builder->buildCudaEngine(*network);

From Caffe

This assumes you have a Caffemodel file

developer.nvidia.com/tensorrt

Future: We are looking at a

streamlined graph input for

TensorFlow and other

frameworks too!

Page 11: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

11

IMPORTING USING THE GRAPH DEFINITION API

If using other frameworks such as TensorFlow you can call our network builder API

ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});

IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);

From any framework

developer.nvidia.com/tensorrt

Page 12: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

12

CUSTOM LAYER API

Page 13: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

13

Application

CUSTOM LAYER API

Allows users to express and provide implementations of novel layers

• TensorRT provides APIs and implementations for most common layers

• Use the Custom Layer API for infrequent or more innovative layers

• Register custom implementations via a callback mechanism

• Can be used in conjunction with reduced precision optimizations

TensorRT

Cuda Runtime

Custom

Layer

Page 14: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

14

CUSTOM LAYER API

Specify and build the network

getNbOutputs()

getOutputDimensions()

configure()

getWorkspaceSize()

Serialize the network

getSerializationSize()

serialize()

Runtime

initialize()

enqueue()

terminate()

Member functions of IPlugin objects

Page 15: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

15

INTEGRATING PLUGINS AND CUSTOM LAYERS

When building the network

Directly register using the API

Plugin factory in caffe parser

Runtime

Runtime factory object

Two samples provided

Simple MNIST sample

Replaces the fully connected layer with a GEMM

Faster R-CNN

We provide a library with implementations of

The needed reshape layers ROIPooling

Page 16: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

16

OPTIMIZE

Page 17: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

17

TENSORRTOptimizations

• Fuse network layers

• Eliminate concatenation layers

• Apply reduced precision

• Specialized kernels

• Autotune for target platform

• Tune for given batch sizeTRAINEDNEURAL NETWORK

OPTIMIZEDINFERENCERUNTIME

developer.nvidia.com/tensorrt

Page 18: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

18

BUILDING THE OPTIMIZED ENGINE

In General

IBuilder* builder = createInferBuilder(gLogger);

builder->setMaxBatchSize(maxBatchSize);

builder->setMaxWorkspaceSize(64 << 20);

auto engine = builder->buildCudaEngine(*network);

INT8

builder->setInt8Mode(true);

IInt8Calibrator* calibrator

builder->setInt8Calibrator(calibrator);

API calls

developer.nvidia.com/tensorrt

See the slides from session 7310 on the INT8 quantization for details!

Page 19: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

19

EXECUTE THE NEURAL NETWORK

IExecutionContext *context = engine->createExecutionContext();

<index> = engine->getBindingIndex(<binding layer name>),

<malloc and cudaMalloc calls > //allocate buffers for data moving in and out using//the index from the getBindingIndex() call above

cudaStream_t stream;

cudaStreamCreate(&stream);

cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU

context.enqueue(<args>);

cudaMemcpyAsync( <args> )); // Copy Output Data to the Host

cudaStreamSynchronize(stream);

Running inference using the API

Page 20: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

20

LOW LATENCY THROUGHPUT

0

500

1000

1500

2000

2500

0 5 10 15 20 25 30 35 40 45 50

Caffe FP32 on CPU TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Thro

ughput

(im

ages/

s)

Batch Size

Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

1

2

4

8

1632

64

Latency (ms)

Page 21: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

21

LOW LATENCY THROUGHPUT

0

500

1000

1500

2000

2500

0 5 10 15 20 25 30 35 40 45 50

Caffe FP32 on CPU TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

Thro

ughput

(im

ages/

s)

Batch Size

Latency (ms)

Same Batch Size

Page 22: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

22

LOW LATENCY THROUGHPUT

0

500

1000

1500

2000

2500

0 5 10 15 20 25 30 35 40 45 50

Caffe FP32 on CPU TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

Thro

ughput

(im

ages/

s)

Batch Size

Latency (ms)

More and faster

Page 23: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

23

DEPLOYING TensorRTAS A MICROSERVICE WITHGPU REST ENGINE (GRE)

Page 24: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

24

GPU REST ENGINE (GRE) SDK

Accelerated microservices for web and mobile

Supercomputer performance for hyperscale datacenters

Up to 50 teraflops per node, min ~250μs response time

Easy to develop new microservices

Open source, integrates with existing infrastructure

Easy to deploy & scale

Ready-to-run Dockerfile

HTTP (~250μs)

GPU REST Engine

Image

Classification

Speech

Recognition…

Image

Scaling

developer.nvidia.com/gre

Page 25: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

25

Context Pool

Request ScopedContextRequest ScopedContext

GPU1

GPU2

Context

Context

Request ScopedContext

Request ScopedContext

Request ScopedContext

Context

Request ScopedContext

Context

Request ScopedContext

Request ScopedContext

Request ScopedContext

RESOURCEPOOL

developer.nvidia.com/gre

Page 26: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

26

ScopedContext<>

REST API

HTTP layer

App layer

Device-layer

func TensorRTInference

classifier_classify()

Microservice

Client

Go

C++

CUDA device GPU

Host CPU

Host CPU

classify()

CLASSIFICATIONMICROSERVICE

developer.nvidia.com/gre

Page 27: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

27

CLASSIFICATION.CPP (1/2)

func classify

classifier_classify()

classify()

constexpr static int kContextsPerDevice = 2;

classifier_ctx* classifier_initialize(char* model_file, char* trained_file,char* mean_file, char* label_file)

{try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<ExecContext> pool;for (int dev = 0; dev < device_count; ++dev) {

std::shared_ptr<InferenceEngine> engine(new InferenceEngine(model_file, trained_file));

for (int i = 0; i < kContextsPerDevice; ++i) {std::unique_ptr<CaffeContext> context(new ExecContext(engine,

Mean_file,label_file, dev));

pool.Push(std::move(context));}}} catch { ... }

}

To allow latency hiding

One per context

developer.nvidia.com/gre

One per GPU

Page 28: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

28

CLASSIFICATION.CPP (2/2)

func classify

classifier_classify()

classify()

const char* classifier_classify(classifier_ctx* ctx,char* buffer, size_t length)

{try{

{ ScopedContext<ExecContext> context(ctx->pool);auto classifier = context->TensorRTClassifier();predictions = classifier->Classify(img);

}

/* Write the top N predictions in JSON format. */}

Uses a scoped context

Lower level classify routine

developer.nvidia.com/gre

Page 29: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

29

CONCLUSION

Page 30: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

30

CONCLUSION

Inference powers an increasing number of features and capabilities

Cutting edge networks including custom layers can be deployed in TensorRT

TensorRT leverages GPU power to deliver throughput and low latency

GRE is a template to follow for creating accelerated microservices

developer.nvidia.com/gre

Page 31: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

31

WANT TO LEARN MORE?

developer.nvidia.com/tensorrt

developer.nvidia.com/greJust Updated with the code shown today!

devblogs.nvidia.com/parallelforall/

NVIDIA Jetson TX2 Delivers Twice …

Production Deep Learning …

www.nvidia.com/en-us/deep-learning-ai/education/

github.com/dusty-nv/jetson-inference

Here at GTC

S7310 8-bit Inference with TensorRTMonday morning, but check web for slides and recording

H7126 Deep Learning Inference with TensorRTWed 3PM Lower Level Pod B

L7123 Neural Network Deployment with DIGITS and TensorRTWed Noon LL21E

Resources to check out

developer.nvidia.com/gre

Page 32: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

[email protected]

THANKS

Page 33: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

35

RESOURCE SLIDES

Page 34: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

37

INT8 INFERENCE

• Main challenge

• INT8 has significantly lower precision and dynamic range compared to FP32

• Requires “smart” quantization and calibration from FP32 to INT8

Challenge

Dynamic Range Min Pos Value

FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45

FP16 -65504 ~ +65504 5.96 x 10-8

INT8 -128 ~ +127 1

developer.nvidia.com/tensorrt

Page 35: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

38

QUANTIZATION OF WEIGHTS

-127 -126 -125 125 126 127

I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )

scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )

Symmetric, Linear Quantization

[-127, 127]

Page 36: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

39NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

QUANTIZATION OF ACTIVATIONS

I8_value = (value > threshold) ?

threshold :

scale * F32_value

How do you decide optimal ‘threshold’?

Activation range is unknown offline, input dependent

Calibration using ‘representative’ dataset

? ? ?

Input

Page 37: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

41

TENSORRTINT8 Workflow

FP32Training Framework

INT8 OPTIMIZATION USING TensorRT

INT8 RUNTIMEUSING TensorRT

INT8PLAN

FP32 NEURALNETWORK

developer.nvidia.com/tensorrt

Calibration

Dataset

Batch Size

Precision

Page 38: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

42

8-BIT INFERENCETop-1 Accuracy

Network FP32 Top1 INT8 Top1 Difference Perf Gain

developer.nvidia.com/tensorrt

Page 39: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

43

main.go

func EmptyKernel_Handler

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) {

C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0])))

io.WriteString(w, string(message[:]))}

func main() {

http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler)

http.ListenAndServe(":8000", nil) }

Calls the C func

Execute server

Set API URL

Page 40: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

44

benchmark.cpp (1/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

constexpr static int kContextsPerDevice = 4;

benchmark_ctx* benchmark_initialize(){

cudaGetDeviceCount(&device_count);

ContextPool<BenchmarkContext> pool;for (int dev = 0; dev < device_count; ++dev)

for (int i = 0; i < kContextsPerDevice; ++i)std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev));pool.Push(std::move(context));

}

4 per GPU

Get # GPUs

Create pool

func EmptyKernel_Handler

Page 41: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

45

benchmark.cpp (2/2)

kernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler

void benchmark_execute(benchmark_ctx* ctx, char* message){

ScopedContext<BenchmarkContext> context(ctx->pool);

cudaStream_t stream = context->CUDAStream();

kernel_wrapper(stream, message);

}

Scoped Context

Run the wrapper

Page 42: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

46

kernel.cukernel_wrapper()

benchmark_execute()

empty_kernel<<<>>>

func EmptyKernel_Handler

__global__ void empty_kernel(char* device_message){

const char message[50] = "Hello world from an (almost) empty CUDA kernel :)";

for(int i=0;i<50;i++){ device_message[i] = message[i];if(message[i]=='\0') break;

}}

void kernel_wrapper(cudaStream_t stream, char* message){

cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault);

host_message = (char*)malloc(message_size);

empty_kernel<<<1, 1, 0, stream>>>(device_message);

cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost);

strncpy(message, host_message, message_size);}

GPU code

Device call

Host side wrapper

Page 43: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

47

TensorRT

• Convolution: Currently only 2D convolutions

• Activation: ReLU, tanh and sigmoid

• Pooling: max and average

• Scale: similar to Caffe Power layer (shift+scale*x)^p

• ElementWise: sum, product or max of two tensors

• LRN: cross-channel only

• Fully-connected: with or without bias

• SoftMax: cross-channel only

• Deconvolution

Layers Types Supported

Page 44: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

48

GRAPH OPTIMIZATIONUnoptimized network

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

1x1 conv.

relu

bias

concat

1x1 conv.

relu

bias5x5 conv.

relu

bias

Page 45: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

49

GRAPH OPTIMIZATIONVertical fusion

concat

max pool

input

next input

concat

1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR 1x1 CBR

Page 46: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

50

GRAPH OPTIMIZATIONHorizontal fusion

concat

max pool

input

next input

concat

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Page 47: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

51

GRAPH OPTIMIZATIONConcat elision

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Page 48: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

52

IDP.4A – 8 BIT INSTRUCTION

i8 i8 i8 i8

× × × ×

i8 i8 i8 i8

i32 + i32

Page 49: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

53

SMALLER AND FASTER

0

0.5

1

1.5

2

2.5

3

3.5

FP32 FP16 on P100 INT8 on P40

Performance

% s

cale

d t

o F

P32

ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease

0

20

40

60

80

100

120

FP32 FP16 on P100 INT8 on P40

Memory Usage

Images

/ s

-

Scale

d t

o F

P32

developer.nvidia.com/tensorrt

Page 50: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

54

Int8 precisionNew in TensorRT

ACCURACYEFFICIENCYPERFORMANCE

0

1000

2000

3000

4000

5000

6000

7000

2 4 128

FP32 INT8

Up To 3x More Images/sec with INT8

Precision

Batch Size

GoogLenet, FP32 vs INT8 precision + TensorRT on

Tesla P40 GPU, 2 Socket Haswell E5-2698 [email protected] with HT off

Images/

Second

0

200

400

600

800

1000

1200

1400

2 4 128

FP32 INT8

Deploy 2x Larger Models with INT8

Precision

Batch Size

Mem

ory

(M

B)

0%

20%

40%

60%

80%

100%

Top 1Accuracy

Top 5Accuracy

FP32 INT8

Deliver full accuracy with INT8

precision

% A

ccura

cy

Page 51: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

55

THROUGHPUT

0

500

1000

1500

2000

2500

1 2 4 8 16 32 64 128

Caffe FP32 on CPU TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Images

/ s

Batch Size

Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

Page 52: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

56

LATENCY

1

10

100

1000

10000

1 2 4 8 16 32 64 128

Caffe FP32 on CPU TensorFlow FP32 on P100

TensorRT FP32 on P100 TensorRT FP16 on P100

Late

ncy (

ms)

Batch Size

Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

Page 53: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale

58

NVIDIA TensorRTHigh-performance deep learning inference for production deployment

developer.nvidia.com/tensorrt

High performance neural network inference engine for production deployment

Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms

Deliver high-performance, low-latency inference demanded by real-time services

Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

2 8 128

CPU-Only

Tesla P40 + TensorRT (FP32)

Tesla P40 + TensorRT (INT8)

Up to 36x More Image/sec

Batch Size

GoogLenet, CPU-only vs Tesla P40 + TensorRTCPU: 1 socket E4 2690 v4 @2.6 GHz, HT-onGPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box

Images/

Second