S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO...
Transcript of S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO...
![Page 1: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/1.jpg)
May 2017 – Chris Gottbrath
S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO-SERVICES WITH TENSORRT, USER EXTENSIBLE LAYERS, AND GPU REST ENGINE
![Page 2: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/2.jpg)
2
AGENDA
Inference
TensorRT
Custom Layer API
GPU REST Engine
Conclusion
![Page 3: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/3.jpg)
3
NEW AI SERVICES POSSIBLE WITH GPU CLOUD
SPOTIFYSONG RECOMMENDATIONS
NETFLIXVIDEO RECOMMENDATIONS
YELPSELECTING COVER PHOTOS
![Page 4: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/4.jpg)
4
DL INFERENCE POWERED API
1. Gather Data
2. Train Your Model
3. Load Your Trained Model into TensorRT
4. Extend TensorRT with your Custom Layer(s) (if necessary)
5. Optimize using TensorRT
6. Deploy using GRE
Step-by-step
![Page 5: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/5.jpg)
5
WHAT IS TensorRT
![Page 6: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/6.jpg)
6
NVIDIA TensorRTHigh-performance deep learning inference for production deployment
developer.nvidia.com/tensorrtGoogLeNet, Tesla P100 + TensorRT (FP16), Tesla K80 + TensorRT (FP32), CPU-Only + Caffe (FP32)
CPU: 1 Socket Broadwell E5-2690 [email protected] with HT off
0
200
400
600
800
1,000
1,200
1,400
1,600
Caffe on 28 coreE5-2690v4 x2b=121.3 ms
Caffeon K80b=125 ms
TensorRTon K80b=110 ms
TensorRTon P100 FP32b=25.4 ms
TensorRTon P100 FP16b=97 ms
Images/
Second
Up To 30x More Images/sec vs.
CPU-Only Inference
TensorRT Optimizer
TensorRT Runtime Engine
Trained Neural
Network
High performance neural network inference optimizer and
runtime engine for production deployment
Maximize inference throughput for latency-critical
services
![Page 7: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/7.jpg)
7
TensorRTDevelopment Workflow
Training FrameworkOptimize
using TensorRTValidate
using TensorRTPLANNEURAL
NETWORK
developer.nvidia.com/tensorrt
Serialize to disk
Batch Size
Precision
![Page 8: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/8.jpg)
8
TensorRTProduction Workflow
Inferusing TensorRT
Serialized PLAN
developer.nvidia.com/tensorrt
![Page 9: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/9.jpg)
9
IMPORT MODEL
![Page 10: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/10.jpg)
10
TO IMPORT A TRAINED MODEL TO TensorRT
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
CaffeParser parser;
auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);
network->markOutput(*blob_name_to_tensor->find(<output layer name>));
builder->setMaxBatchSize(<size>);
builder->setMaxWorkspaceSize(<size>);
ICudaEngine* engine = builder->buildCudaEngine(*network);
From Caffe
This assumes you have a Caffemodel file
developer.nvidia.com/tensorrt
Future: We are looking at a
streamlined graph input for
TensorFlow and other
frameworks too!
![Page 11: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/11.jpg)
11
IMPORTING USING THE GRAPH DEFINITION API
If using other frameworks such as TensorFlow you can call our network builder API
ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});
IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);
From any framework
developer.nvidia.com/tensorrt
![Page 12: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/12.jpg)
12
CUSTOM LAYER API
![Page 13: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/13.jpg)
13
Application
CUSTOM LAYER API
Allows users to express and provide implementations of novel layers
• TensorRT provides APIs and implementations for most common layers
• Use the Custom Layer API for infrequent or more innovative layers
• Register custom implementations via a callback mechanism
• Can be used in conjunction with reduced precision optimizations
TensorRT
Cuda Runtime
Custom
Layer
![Page 14: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/14.jpg)
14
CUSTOM LAYER API
Specify and build the network
getNbOutputs()
getOutputDimensions()
configure()
getWorkspaceSize()
Serialize the network
getSerializationSize()
serialize()
Runtime
initialize()
enqueue()
terminate()
Member functions of IPlugin objects
![Page 15: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/15.jpg)
15
INTEGRATING PLUGINS AND CUSTOM LAYERS
When building the network
Directly register using the API
Plugin factory in caffe parser
Runtime
Runtime factory object
Two samples provided
Simple MNIST sample
Replaces the fully connected layer with a GEMM
Faster R-CNN
We provide a library with implementations of
The needed reshape layers ROIPooling
![Page 16: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/16.jpg)
16
OPTIMIZE
![Page 17: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/17.jpg)
17
TENSORRTOptimizations
• Fuse network layers
• Eliminate concatenation layers
• Apply reduced precision
• Specialized kernels
• Autotune for target platform
• Tune for given batch sizeTRAINEDNEURAL NETWORK
OPTIMIZEDINFERENCERUNTIME
developer.nvidia.com/tensorrt
![Page 18: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/18.jpg)
18
BUILDING THE OPTIMIZED ENGINE
In General
IBuilder* builder = createInferBuilder(gLogger);
builder->setMaxBatchSize(maxBatchSize);
builder->setMaxWorkspaceSize(64 << 20);
auto engine = builder->buildCudaEngine(*network);
INT8
builder->setInt8Mode(true);
IInt8Calibrator* calibrator
builder->setInt8Calibrator(calibrator);
API calls
developer.nvidia.com/tensorrt
See the slides from session 7310 on the INT8 quantization for details!
![Page 19: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/19.jpg)
19
EXECUTE THE NEURAL NETWORK
IExecutionContext *context = engine->createExecutionContext();
<index> = engine->getBindingIndex(<binding layer name>),
<malloc and cudaMalloc calls > //allocate buffers for data moving in and out using//the index from the getBindingIndex() call above
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU
context.enqueue(<args>);
cudaMemcpyAsync( <args> )); // Copy Output Data to the Host
cudaStreamSynchronize(stream);
Running inference using the API
![Page 20: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/20.jpg)
20
LOW LATENCY THROUGHPUT
0
500
1000
1500
2000
2500
0 5 10 15 20 25 30 35 40 45 50
Caffe FP32 on CPU TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Thro
ughput
(im
ages/
s)
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
1
2
4
8
1632
64
Latency (ms)
![Page 21: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/21.jpg)
21
LOW LATENCY THROUGHPUT
0
500
1000
1500
2000
2500
0 5 10 15 20 25 30 35 40 45 50
Caffe FP32 on CPU TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
Thro
ughput
(im
ages/
s)
Batch Size
Latency (ms)
Same Batch Size
![Page 22: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/22.jpg)
22
LOW LATENCY THROUGHPUT
0
500
1000
1500
2000
2500
0 5 10 15 20 25 30 35 40 45 50
Caffe FP32 on CPU TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
Thro
ughput
(im
ages/
s)
Batch Size
Latency (ms)
More and faster
![Page 23: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/23.jpg)
23
DEPLOYING TensorRTAS A MICROSERVICE WITHGPU REST ENGINE (GRE)
![Page 24: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/24.jpg)
24
GPU REST ENGINE (GRE) SDK
Accelerated microservices for web and mobile
Supercomputer performance for hyperscale datacenters
Up to 50 teraflops per node, min ~250μs response time
Easy to develop new microservices
Open source, integrates with existing infrastructure
Easy to deploy & scale
Ready-to-run Dockerfile
HTTP (~250μs)
GPU REST Engine
Image
Classification
Speech
Recognition…
Image
Scaling
developer.nvidia.com/gre
![Page 25: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/25.jpg)
25
Context Pool
Request ScopedContextRequest ScopedContext
GPU1
GPU2
Context
Context
Request ScopedContext
Request ScopedContext
Request ScopedContext
Context
Request ScopedContext
Context
Request ScopedContext
Request ScopedContext
Request ScopedContext
RESOURCEPOOL
developer.nvidia.com/gre
![Page 26: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/26.jpg)
26
ScopedContext<>
REST API
HTTP layer
App layer
Device-layer
func TensorRTInference
classifier_classify()
Microservice
Client
Go
C++
CUDA device GPU
Host CPU
Host CPU
classify()
CLASSIFICATIONMICROSERVICE
developer.nvidia.com/gre
![Page 27: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/27.jpg)
27
CLASSIFICATION.CPP (1/2)
func classify
classifier_classify()
classify()
constexpr static int kContextsPerDevice = 2;
classifier_ctx* classifier_initialize(char* model_file, char* trained_file,char* mean_file, char* label_file)
{try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<ExecContext> pool;for (int dev = 0; dev < device_count; ++dev) {
std::shared_ptr<InferenceEngine> engine(new InferenceEngine(model_file, trained_file));
for (int i = 0; i < kContextsPerDevice; ++i) {std::unique_ptr<CaffeContext> context(new ExecContext(engine,
Mean_file,label_file, dev));
pool.Push(std::move(context));}}} catch { ... }
}
To allow latency hiding
One per context
developer.nvidia.com/gre
One per GPU
![Page 28: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/28.jpg)
28
CLASSIFICATION.CPP (2/2)
func classify
classifier_classify()
classify()
const char* classifier_classify(classifier_ctx* ctx,char* buffer, size_t length)
{try{
{ ScopedContext<ExecContext> context(ctx->pool);auto classifier = context->TensorRTClassifier();predictions = classifier->Classify(img);
}
/* Write the top N predictions in JSON format. */}
Uses a scoped context
Lower level classify routine
developer.nvidia.com/gre
![Page 29: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/29.jpg)
29
CONCLUSION
![Page 30: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/30.jpg)
30
CONCLUSION
Inference powers an increasing number of features and capabilities
Cutting edge networks including custom layers can be deployed in TensorRT
TensorRT leverages GPU power to deliver throughput and low latency
GRE is a template to follow for creating accelerated microservices
developer.nvidia.com/gre
![Page 31: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/31.jpg)
31
WANT TO LEARN MORE?
developer.nvidia.com/tensorrt
developer.nvidia.com/greJust Updated with the code shown today!
devblogs.nvidia.com/parallelforall/
NVIDIA Jetson TX2 Delivers Twice …
Production Deep Learning …
www.nvidia.com/en-us/deep-learning-ai/education/
github.com/dusty-nv/jetson-inference
Here at GTC
S7310 8-bit Inference with TensorRTMonday morning, but check web for slides and recording
H7126 Deep Learning Inference with TensorRTWed 3PM Lower Level Pod B
L7123 Neural Network Deployment with DIGITS and TensorRTWed Noon LL21E
Resources to check out
developer.nvidia.com/gre
![Page 33: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/33.jpg)
35
RESOURCE SLIDES
![Page 34: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/34.jpg)
37
INT8 INFERENCE
• Main challenge
• INT8 has significantly lower precision and dynamic range compared to FP32
• Requires “smart” quantization and calibration from FP32 to INT8
Challenge
Dynamic Range Min Pos Value
FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45
FP16 -65504 ~ +65504 5.96 x 10-8
INT8 -128 ~ +127 1
developer.nvidia.com/tensorrt
![Page 35: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/35.jpg)
38
QUANTIZATION OF WEIGHTS
-127 -126 -125 125 126 127
I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )
scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )
Symmetric, Linear Quantization
[-127, 127]
![Page 36: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/36.jpg)
39NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
QUANTIZATION OF ACTIVATIONS
I8_value = (value > threshold) ?
threshold :
scale * F32_value
How do you decide optimal ‘threshold’?
Activation range is unknown offline, input dependent
Calibration using ‘representative’ dataset
? ? ?
Input
![Page 37: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/37.jpg)
41
TENSORRTINT8 Workflow
FP32Training Framework
INT8 OPTIMIZATION USING TensorRT
INT8 RUNTIMEUSING TensorRT
INT8PLAN
FP32 NEURALNETWORK
developer.nvidia.com/tensorrt
Calibration
Dataset
Batch Size
Precision
![Page 38: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/38.jpg)
42
8-BIT INFERENCETop-1 Accuracy
Network FP32 Top1 INT8 Top1 Difference Perf Gain
developer.nvidia.com/tensorrt
![Page 39: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/39.jpg)
43
main.go
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) {
C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0])))
io.WriteString(w, string(message[:]))}
func main() {
http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler)
http.ListenAndServe(":8000", nil) }
Calls the C func
Execute server
Set API URL
![Page 40: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/40.jpg)
44
benchmark.cpp (1/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
constexpr static int kContextsPerDevice = 4;
benchmark_ctx* benchmark_initialize(){
cudaGetDeviceCount(&device_count);
ContextPool<BenchmarkContext> pool;for (int dev = 0; dev < device_count; ++dev)
for (int i = 0; i < kContextsPerDevice; ++i)std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev));pool.Push(std::move(context));
}
4 per GPU
Get # GPUs
Create pool
func EmptyKernel_Handler
![Page 41: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/41.jpg)
45
benchmark.cpp (2/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
void benchmark_execute(benchmark_ctx* ctx, char* message){
ScopedContext<BenchmarkContext> context(ctx->pool);
cudaStream_t stream = context->CUDAStream();
kernel_wrapper(stream, message);
}
Scoped Context
Run the wrapper
![Page 42: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/42.jpg)
46
kernel.cukernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
__global__ void empty_kernel(char* device_message){
const char message[50] = "Hello world from an (almost) empty CUDA kernel :)";
for(int i=0;i<50;i++){ device_message[i] = message[i];if(message[i]=='\0') break;
}}
void kernel_wrapper(cudaStream_t stream, char* message){
cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault);
host_message = (char*)malloc(message_size);
empty_kernel<<<1, 1, 0, stream>>>(device_message);
cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost);
strncpy(message, host_message, message_size);}
GPU code
Device call
Host side wrapper
![Page 43: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/43.jpg)
47
TensorRT
• Convolution: Currently only 2D convolutions
• Activation: ReLU, tanh and sigmoid
• Pooling: max and average
• Scale: similar to Caffe Power layer (shift+scale*x)^p
• ElementWise: sum, product or max of two tensors
• LRN: cross-channel only
• Fully-connected: with or without bias
• SoftMax: cross-channel only
• Deconvolution
Layers Types Supported
![Page 44: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/44.jpg)
48
GRAPH OPTIMIZATIONUnoptimized network
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
concat
1x1 conv.
relu
bias5x5 conv.
relu
bias
![Page 45: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/45.jpg)
49
GRAPH OPTIMIZATIONVertical fusion
concat
max pool
input
next input
concat
1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR 1x1 CBR
![Page 46: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/46.jpg)
50
GRAPH OPTIMIZATIONHorizontal fusion
concat
max pool
input
next input
concat
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
![Page 47: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/47.jpg)
51
GRAPH OPTIMIZATIONConcat elision
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
![Page 48: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/48.jpg)
52
IDP.4A – 8 BIT INSTRUCTION
i8 i8 i8 i8
× × × ×
i8 i8 i8 i8
i32 + i32
![Page 49: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/49.jpg)
53
SMALLER AND FASTER
0
0.5
1
1.5
2
2.5
3
3.5
FP32 FP16 on P100 INT8 on P40
Performance
% s
cale
d t
o F
P32
ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease
0
20
40
60
80
100
120
FP32 FP16 on P100 INT8 on P40
Memory Usage
Images
/ s
-
Scale
d t
o F
P32
developer.nvidia.com/tensorrt
![Page 50: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/50.jpg)
54
Int8 precisionNew in TensorRT
ACCURACYEFFICIENCYPERFORMANCE
0
1000
2000
3000
4000
5000
6000
7000
2 4 128
FP32 INT8
Up To 3x More Images/sec with INT8
Precision
Batch Size
GoogLenet, FP32 vs INT8 precision + TensorRT on
Tesla P40 GPU, 2 Socket Haswell E5-2698 [email protected] with HT off
Images/
Second
0
200
400
600
800
1000
1200
1400
2 4 128
FP32 INT8
Deploy 2x Larger Models with INT8
Precision
Batch Size
Mem
ory
(M
B)
0%
20%
40%
60%
80%
100%
Top 1Accuracy
Top 5Accuracy
FP32 INT8
Deliver full accuracy with INT8
precision
% A
ccura
cy
![Page 51: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/51.jpg)
55
THROUGHPUT
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64 128
Caffe FP32 on CPU TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Images
/ s
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
![Page 52: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/52.jpg)
56
LATENCY
1
10
100
1000
10000
1 2 4 8 16 32 64 128
Caffe FP32 on CPU TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Late
ncy (
ms)
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
![Page 53: S7458 DEPLOYING UNIQUE DL NETWORKS AS MICRO …on-demand.gputechconf.com/gtc/2017/presentation/s...Accelerated microservices for web and mobile Supercomputer performance for hyperscale](https://reader033.fdocuments.us/reader033/viewer/2022050110/5f479a4d0245c86e9747705f/html5/thumbnails/53.jpg)
58
NVIDIA TensorRTHigh-performance deep learning inference for production deployment
developer.nvidia.com/tensorrt
High performance neural network inference engine for production deployment
Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms
Deliver high-performance, low-latency inference demanded by real-time services
Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
2 8 128
CPU-Only
Tesla P40 + TensorRT (FP32)
Tesla P40 + TensorRT (INT8)
Up to 36x More Image/sec
Batch Size
GoogLenet, CPU-only vs Tesla P40 + TensorRTCPU: 1 socket E4 2690 v4 @2.6 GHz, HT-onGPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box
Images/
Second