Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow AI Models in...

110
HIGH PERFORMANCE TENSORFLOW IN PRODUCTION WITH GPUS NVIDIA GPU TECH CONFERENCE WASHINGTON DC, NOV 2017 CHRIS FREGLY, [email protected]

Transcript of Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow AI Models in...

Page 1: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

HIGH PERFORMANCE TENSORFLOW IN PRODUCTION WITH GPUSNVIDIA GPU TECH CONFERENCEWASHINGTON DC, NOV 2017CHRIS FREGLY, [email protected]

Page 2: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

INTRODUCTIONS: ME§ Chris Fregly, Founder & Engineer @ PipelineAI

§ Formerly Netflix and Databricks

§ Advanced Spark and TensorFlow MeetupPlease Join Our 50,000+ Members Globally!

Contact [email protected]

@cfregly

* San Francisco

* Chicago* Austin* Washington DC* Dusseldorf* London

Page 3: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

INTRODUCTIONS: YOU§ Software Engineer or Data {Scientist, Engineer, Analyst}

§ Interested in Optimizing + Deploying TF Models to Production

§ Nice to have working knowledge of TensorFlow, but not required

Page 4: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

CONTENT BREAKDOWN50% Training Optimizations (GPUs, Pipeline, XLA+JIT)50% Prediction Optimizations (XLA+AOT, TF Serving)

Why Heavy Focus on Model Prediction vs. Model Training?

50 Data Scientists <<< 120 Million App Users

TrainingBoring & Batch

PredictionExciting & Real-Time!!

Page 5: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

100% OPEN SOURCE CODE§ https://github.com/PipelineAI/pipeline/

§ Please ! this GitHub Repo!

§ All slides, code, notebooks, and Docker images here:https://github.com/PipelineAI/pipeline/tree/master/gpu.ml

Page 6: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

HANDS-ON EXERCISES§ Combo of Jupyter Notebooks and Command Line§ Command Line through Jupyter Terminal

§ Some Exercises Based on Experimental Features

You May See Errors. Stay Calm. You Will Be OK!!

Page 7: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

OTHER TENSORFLOW TALKS AT GTC!

4:30-4:55pm Todayin Hemisphere B.

Page 8: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

Part 1: TensorFlow Model Training

Part 2: TensorFlow Model Serving

Page 9: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PIPELINE.AI OVERVIEW

Page 10: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

§ Package, Deploy, and Tune Both Model + Runtime§ Experiment Safely in Production§ Compare Models + Runtimes Both Offline + Online§ Shift Traffic to Winning Model or Cheaper Cloud!

Page 11: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PACKAGE MODEL + RUNTIME AS ONE

§ Package Model + Runtime into Immutable Docker Image

§ Same Package: Local, Dev, and Prod

§ No Dependency Surprises in Production

§ Deploy (and Tune )Model + Runtime as a Single Unit

Page 12: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

OPTIMIZE MODEL + RUNTIME AS ONE

§ Tune BOTH Model Params + Runtime Configs Together

§ Generate Native CPU + GPU Code

§ Quantize Model Weights + Activations

§ Swap Runtimes: TensorFlow Serving, Nvidia TensorRT, …

Page 13: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

NVIDIA TENSOR-RT RUNTIME

§ Post-Training Model Optimizations§ Similar to TF Graph Transform Tool

§ GPU-Optimized Prediction Runtime§ Alternative to TensorFlow Serving

§ PipelineAI Supports TensorRT!

Page 14: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

§ Package, Deploy, and Tune Both Model + Runtime§ Experiment Safely in Production§ Compare Models + Runtimes Both Offline + Online§ Shift Traffic to Winning Model or Cheaper Cloud!

Page 15: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

EXPERIMENT SAFELY IN PRODUCTION

§ Setup Experiments Directly from Jupyter Notebooks

§ Deploy to 1% Prod Traffic

§ Deploy in Traffic-Shadow Mode

§ Tear-Down or Rollback Experiments Quickly

Page 16: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

§ Package, Deploy, and Tune Both Model + Runtime§ Experiment Safely in Production§ Compare Models + Runtimes Both Offline + Online§ Shift Traffic to Winning Model or Cheaper Cloud!

Page 17: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

COMPARE MODELS OFFLINE + ONLINE

§ Offline Metrics§ Training Accuracy§ Validation Accuracy

§ Online Real-Time Metrics§ Prediction Precision§ Latency + Throughput

Page 18: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PREDICTION PROFILING + TUNING§ Pinpoint Performance Bottlenecks

§ Fine-Grained Prediction Metrics

§ 3 Logic Steps in a Prediction1.transform_request()2.predict()3.transform_response()

Page 19: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

VIEW LIVE PREDICTION STREAMS§ Visually Compare Real-Time Predictions

Page 20: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

§ Package, Deploy, and Tune Both Model + Runtime§ Experiment Safely in Production§ Compare Models + Runtimes Both Offline + Online§ Shift Traffic to Winning Model or Cheaper Cloud!

Page 21: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SHIFT TRAFFIC TO MAXIMIZE REVENUE§ Shift Traffic to Winning Model using AI Bandit Algorithms

Page 22: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SHIFT TRAFFIC TO MINIMIZE COST

§ Real-Time Cost Per Prediction

§ Across Clouds + On-Premise

§ Explore/Exploit Bandits

Page 23: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

CONTINUOUS MODEL TRAINING§ Identify and Fix Borderline Predictions (50-50% Confidence)

§ Fix Along Class Boundaries

§ Retrain on New Labeled Data

§ Enables Crowd Sourcing

§ Game-ify Labeling Process

Page 24: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

Part 1: TensorFlow Model Training

Part 2: TensorFlow Model Serving

Page 25: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 1: TensorFlow Model Training

§ GPUs and TensorFlow§ Train, Inspect, and Debug TensorFlow Models§ TensorFlow Distributed Model Training on a Cluster§ Optimize Training with JIT XLA Compiler

Page 26: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

EVERYBODY GETS A GPU!

Page 27: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SETUP ENVIRONMENT

§ Step 1: Browse to the following:http://allocator.community.pipeline.ai/allocate

§ Step 2: Browse to the following:http://<ip-address>

§ Step 3: Browse around. I will provide a Jupyter Username/Password soon.

Need Help? Use the Chat!

Page 28: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

VERIFY SETUP

http://<ip-address>

Any username,Any password!

Page 29: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S EXPLORE OUR ENVIRONMENT§ Navigate to the following notebook:

01_Explore_Environment

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 30: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PULSE CHECK

Page 31: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BREAK

§ Please ! this GitHub Repo!

§ All slides, code, notebooks, and Docker images here:https://github.com/PipelineAI/pipeline/tree/master/gpu.ml

Need Help? Use the Chat!

Page 32: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SETTING UP TENSORFLOW WITH GPUS

§ Very Painful!

§ Especially inside Docker§ Use nvidia-docker

§ Especially on Kubernetes!§ Use Kubernetes 1.7+

§ http://pipeline.ai for GitHub + DockerHub Links

Page 33: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

GPU HALF-PRECISION SUPPORT§ FP32 is “Full Precision”, FP16 is “Half Precision”§ Supported by Pascal P100 (2016) and Volta V100 (2017)§ Half-Precision is OK for Approximate Deep Learning Use Cases§ Fit Two(2) FP16’s into FP32 GPU Cores for 2x Throughput!

You Can Set TF_FP16_MATMUL_USE_FP32_COMPUTE=0

on GPU w/ Compute Capability(CC) 5.3+

Page 34: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

VOLTA V100 (2017) VS. PASCAL P100 (2016)§ 84 Streaming Multiprocessors (SM’s)§ 5,376 GPU Cores§ 672 Tensor Cores (ie. Google TPU)

§ Mixed FP16/FP32 Precision § More Shared Memory§ New L0 Instruction Cache§ Faster L1 Data Cache§ V100 vs. P100 Performance

§ 12x TFLOPS @ Peak Training§ 6x Inference Throughput

Page 35: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

V100 AND CUDA 9§ Independent Thread Scheduling - Finally!!

§ Similar to CPU fine-grained thread synchronization semantics§ Allows GPU to yield execution of any thread

§ Still Optimized for SIMT (Same Instruction Multiple Thread)§ SIMT units automatically scheduled together

§ Explicit Synchronization

P100 V100

Page 36: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

GPU CUDA PROGRAMMING

§ Barbaric, But Fun Barbaric

§ Must Know Hardware Very Well

§ Hardware Changes are Painful

§ Many Great Debuggers Exist

Page 37: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

CUDA STREAMS

§ Asynchronous I/O Transfer

§ Overlap Compute and I/O

§ Keeps GPUs Saturated

§ Fundamental to Queue Framework in TensorFlow

Page 38: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S SEE WHAT THIS THING CAN DO!§ Navigate to the following notebook:

01a_Explore_GPU01b_Explore_Numba

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 39: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 1: TensorFlow Model Training

§ GPUs and TensorFlow§ Train, Inspect, and Debug TensorFlow Models§ TensorFlow Distributed Model Training on a Cluster§ Optimize Training with JIT XLA Compiler

Page 40: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

TRAINING TERMINOLOGY§ Tensors: N-Dimensional Arrays§ ie. Scalar, Vector, Matrix

§ Operations: MatMul, Add, SummaryLog,…§ Graph: Graph of Operations (DAG)§ Session: Contains Graph(s)§ Feeds: Feed Inputs into Placeholder§ Fetches: Fetch Output from Operation§ Variables: What We Learn Through Training§ aka “weights”, “parameters”

§ Devices: Hardware device on which we train

-TensorFlow-Trains

Variables

-User-FetchesOutputs

-User-FeedsInputs

-TensorFlow-Performs

Operations

-TensorFlow-Flows

Tensors

with tf.device(“/gpu:0,/gpu:1”):

Page 41: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

TENSORFLOW MODEL§ MetaGraph

§ Combines GraphDef and Metadata§ GraphDef

§ Architecture of your model (nodes, edges)

§ Metadata§ Asset: Accompanying assets to your model§ SignatureDef: Maps external : internal tensors

§ Variables§ Stored separately during training (checkpoint)§ Allows training to continue from any checkpoint§ Variables are “frozen” into Constants when deployed for inference

GraphDef

x

W

mul add

b

MetaGraphMetadata

AssetsSignatureDef

TagsVersion

Variables:“W” : 0.328“b” : -1.407

Page 42: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

TENSORFLOW SESSION

Session

graph: GraphDef

Variables:“W” : 0.328“b” : -1.407

Variables arePeriodically

Checkpointed

GraphDefis Static

Page 43: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

EXTEND EXISTING DATA PIPELINES§ Data Processing

§ HDFS/Hadoop§ Spark

§ Containers§ Docker

§ Schedulers§ Kubernetes§ Mesos

<dependency> <groupId>org.tensorflow</groupId> <artifactId>tensorflow-hadoop</artifactId>

</dependency>

https://github.com/tensorflow/ecosystem

Page 44: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

DON’T USE FEED_DICT!!§ Not Optimized for Production Pipelines§ Single-Threaded, Synchronous, SLOW!§ feed_dict Requires Python <-> C++ Serialization§ Retrieves Next Batch After Current Batch is Done§ CPUs/GPUs Not Fully Utilized!§ Use Queue or Dataset API

sess.run(train_step, feed_dict={…}

Page 45: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

QUEUES

§ More than traditional Queue§ Uses CUDA Streams§ Perform I/O, pre-processing, cropping, shuffling, …§ Pull from HDFS, S3, Google Storage, Kafka, ...§ Combine many small files into large TFRecord files§ Use CPUs to free GPUs for compute§ Helps saturate CPUs and GPUs

Page 46: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

QUEUE CAPACITY PLANNING§ batch_size

§ # examples / batch (ie. 64 jpg)§ Limited by GPU RAM

§ num_processing_threads§ CPU threads pull and pre-process batches of data§ Limited by CPU Cores

§ queue_capacity§ Limited by CPU RAM (ie. 5 * batch_size)

Page 47: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

DETECT UNDERUTILIZED CPUS, GPUS

§ Instrument training code to generate “timelines”

§ Analyze with Google Web Tracing Framework (WTF)

§ Monitor CPU with `top`, GPU with `nvidia-smi`

http://google.github.io/tracing-framework/

from tensorflow.python.client import timeline

trace = timeline.Timeline(step_stats=run_metadata.step_stats)

with open('timeline.json', 'w') as trace_file:trace_file.write(trace.generate_chrome_trace_format(show_memory=True))

Page 48: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S FEED DATA WITH A QUEUE§ Navigate to the following notebook:

02_Feed_Queue_HDFS

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 49: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PULSE CHECK

Page 50: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BREAK

§ Please ! this GitHub Repo!

§ All slides, code, notebooks, and Docker images here:https://github.com/PipelineAI/pipeline/tree/master/gpu.ml

Need Help? Use the Chat!

Page 51: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S TRAIN A MODEL (CPU)§ Navigate to the following notebook:

03_Train_Model_CPU

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 52: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S TRAIN A MODEL (GPU)§ Navigate to the following notebook:

03a_Train_Model_GPU

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 53: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

TENSORFLOW DEBUGGER§ Step through Operations§ Inspect Inputs and Outputs§ Wrap Session in Debug Session

sess = tf.Session(config=config)sess =

tf_debug.LocalCLIDebugWrapperSession(sess)

Page 54: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S DEBUG A MODEL§ Navigate to the following notebook:

04_Debug_Model

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 55: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 1: TensorFlow Model Training

§ GPUs and TensorFlow§ Train, Inspect, and Debug TensorFlow Models§ TensorFlow Distributed Model Training on a Cluster§ Optimize Training with JIT XLA Compiler

Page 56: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SINGLE NODE, MULTI-GPU TRAINING§ cpu:0

§ By default, all CPUs§ Requires extra config to target a CPU

§ gpu:0..n§ Each GPU has a unique id§ TF usually prefers a single GPU

§ xla_cpu:0, xla_gpu:0..n§ “JIT Compiler Device”§ Hints TensorFlow to attempt JIT Compile

with tf.device(“/cpu:0”):

with tf.device(“/gpu:0”):

with tf.device(“/gpu:1”):

GPU 0 GPU 1

Page 57: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

DISTRIBUTED, MULTI-NODE TRAINING§ TensorFlow Automatically Inserts Send and Receive Ops into Graph§ Parameter Server Synchronously Aggregates Updates to Variables§ Nodes with Multiple GPUs will Pre-Aggregate Before Sending to PS

Worker0 Worker0

Worker1

Worker0 Worker1 Worker2

gpu0 gpu1

gpu2 gpu3

gpu0 gpu1

gpu2 gpu3

gpu0 gpu1

gpu2 gpu3

gpu0

gpu1

gpu0

gpu0

Page 58: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

DATA PARALLEL VS MODEL PARALLEL

§ Data Parallel (“Between-Graph Replication”)§ Send exact same model to each device§ Each device operates on its partition of data

§ ie. Spark sends same function to many workers§ Each worker operates on their partition of data

§ Model Parallel (“In-Graph Replication”)§ Send different partition of model to each device§ Each device operates on all data

Very Difficult!!

Required for Large Models.(GPU RAM Limitation)

Page 59: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SYNCHRONOUS VS. ASYNCHRONOUS§ Synchronous

§ Nodes compute gradients§ Nodes update Parameter Server (PS)§ Nodes sync on PS for latest gradients

§ Asynchronous§ Some nodes delay in computing gradients§ Nodes don’t update PS§ Nodes get stale gradients from PS§ May not converge due to stale reads!

Page 60: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

CHIEF WORKER§ Worker Task 0 is Usually the Chief

§ Task 0 is guaranteed to exist

§ Performs Maintenance Tasks§ Writes log summaries§ Instructs PS to checkpoint vars§ Performs PS health checks§ (Re-)Initialize variables at (re-)start of training

Page 61: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

NODE AND PROCESS FAILURES

§ Checkpoint to Persistent Storage (HDFS, S3)§ Use MonitoredTrainingSession and Hooks§ Use a Good Cluster Orchestrator (ie. Kubernetes,Mesos)§ Understand Failure Modes and Recovery States

Stateless, Not Bad: Training Continues Stateful, Bad: Training Must Stop Dios Mio! Long Night Ahead…

Page 62: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S TRAIN DISTRIBUTED TENSORFLOW§ Navigate to the following notebook:

05_Train_Model_Distributed_CPUor 05a_Train_Model_Distributed_GPU

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 63: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

USE EXPERIMENT AND ESTIMATOR API§ Unified API: Local + Distributed§ Clean Hooks for Data Retrieval§ Config: TF_CONFIG + RunConfig§ Hyper-Parameter Tuning with HParams§ run() or tune() using learn_runner API§ Works Well with Google Cloud ML (Surprised?!)

https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/customestimator/trainer/{model.py, task.py}

Page 64: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BATCH NORMALIZATION

§ Each Mini-Batch May Have Wildly Different Distributions§ Normalize per Batch (and Layer)§ Faster Training§ Weights are Learned Quicker§ Final Model is More Accurate§ Final mean + variance Are Folded Into Our Graph Later

-- (Almost)Always Use Batch Normalization! --

z = tf.matmul(a_prev, W)a = tf.nn.relu(z)

a_mean, a_var = tf.nn.moments(a, [0])

scale = tf.Variable(tf.ones([depth/channels]))beta = tf.Variable(tf.zeros ([depth/channels]))

bn = tf.nn.batch_normalizaton(a, a_mean, a_var, beta, scale, 0.001)

Page 65: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

OPTIMIZE GRAPH EXECUTION ORDER

§ https://github.com/yaroslavvb/stuff

Linearize to minimize graphmemory usage

Page 66: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SEPARATE TRAINING + VALIDATION

§ Separate Training and Validation Clusters

§ Validate Upon Checkpoint

§ Avoid Resource Contention

§ Let Training Continue in Parallel with Validation

TrainingCluster

ValidationCluster

Parameter ServerCluster

Page 67: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PULSE CHECK

Page 68: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BREAK

§ Please ! this GitHub Repo!

§ All slides, code, notebooks, and Docker images here:https://github.com/PipelineAI/pipeline/tree/master/gpu.ml

Need Help? Use the Chat!

Page 69: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 1: TensorFlow Model Training

§ GPUs and TensorFlow§ Train, Inspect, and Debug TensorFlow Models§ TensorFlow Distributed Model Training on a Cluster§ Optimize Training with JIT XLA Compiler

Page 70: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

XLA FRAMEWORK§ Accelerated Linear Algebra (XLA)§ Goals:

§ Reduce reliance on custom operators§ Improve execution speed§ Improve memory usage§ Reduce mobile footprint§ Improve portability

§ Helps TF Stay Flexible and Performant

Page 71: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

XLA HIGH LEVEL OPTIMIZER (HLO)

§ Compiler Intermediate Representation (IR)§ Independent of source and target language§ Define Graphs using HLO Language§ XLA Step 1 Emits Target-Independent HLO § XLA Step 2 Emits Target-Dependent LLVM§ LLVM Emits Native Code Specific to Target § Supports x86-64, ARM64 (CPU), and NVPTX (GPU)

Page 72: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

JIT COMPILER§ Just-In-Time Compiler§ Built on XLA Framework§ Goals:

§ Reduce memory movement – especially useful on GPUs§ Reduce overhead of multiple function calls

§ Similar to Spark Operator Fusing in Spark 2.0§ Unroll Loops, Fuse Operators, Fold Constants, …§ Scope to session, device, or `with jit_scope():`

Page 73: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

VISUALIZING JIT COMPILER IN ACTION

Before After

Google Web Tracing Framework:http://google.github.io/tracing-framework/

from tensorflow.python.client import timelinetrace = timeline.Timeline(step_stats=run_metadata.step_stats)with open('timeline.json', 'w') as trace_file:trace_file.write(

trace.generate_chrome_trace_format(show_memory=True))

Page 74: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

VISUALIZING FUSING OPERATORS

pip install graphviz

dot -Tpng \/tmp/hlo_graph_1.w5LcGs.dot \-o hlo_graph_1.png

GraphViz:http://www.graphviz.org

hlo_*.dot files generated by XLA

Page 75: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S TRAIN WITH XLA CPU§ Navigate to the following notebook:

06_Train_Model_XLA_CPU

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 76: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S TRAIN WITH XLA GPU§ Navigate to the following notebook:

06a_Train_Model_XLA_GPU

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 77: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

Part 1: TensorFlow Model Training

Part 2: TensorFlow Model Serving

Page 78: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 2: TensorFlow Model Serving

§ AOT XLA Compiler and Graph Transform Tool§ Key Components of TensorFlow Serving§ Deploy Optimized TensorFlow Model§ Optimize TensorFlow Serving Runtime

Page 79: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AOT COMPILER§ Standalone, Ahead-Of-Time (AOT) Compiler§ Built on XLA framework§ tfcompile§ Creates executable with minimal TensorFlow Runtime needed

§ Includes only dependencies needed by subgraph computation§ Creates functions with feeds (inputs) and fetches (outputs)

§ Packaged as cc_libary header and object files to link into your app§ Commonly used for mobile device inference graph

§ Currently, only CPU x86-64 and ARM are supported - no GPU

Page 80: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

GRAPH TRANSFORM TOOL (GTT)

§ Post-Training Optimization to Prepare for Inference§ Remove Training-only Ops (checkpoint, drop out, logs)§ Remove Unreachable Nodes between Given feed -> fetch§ Fuse Adjacent Operators to Improve Memory Bandwidth§ Fold Final Batch Norm mean and variance into Variables§ Round Weights/Variables to improve compression (ie. 70%)§ Quantize (FP32 -> INT8) to Speed Up Math Operations

Page 81: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BEFORE OPTIMIZATIONS

Page 82: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

GRAPH TRANSFORM TOOLtransform_graph \--in_graph=tensorflow_inception_graph.pb \ ß Original Graph--out_graph=optimized_inception_graph.pb \ ß Transformed Graph--inputs='Mul' \ ß Feed (Input)--outputs='softmax' \ ß Fetch (Output) --transforms=' ß List of Transforms strip_unused_nodesremove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_normsfold_old_batch_normsquantize_weightsquantize_nodes'

Page 83: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AFTER STRIPPING UNUSED NODES

§ Optimizations§ strip_unused_nodes

§ Results§ Graph much simpler§ File size much smaller

Page 84: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AFTER REMOVING UNUSED NODES

§ Optimizations§ strip_unused_nodes§ remove_nodes

§ Results§ Pesky nodes removed§ File size a bit smaller

Page 85: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AFTER FOLDING CONSTANTS

§ Optimizations§ strip_unused_nodes§ remove_nodes§ fold_constants

§ Results§ Placeholders (feeds) -> Variables*

(*Why Variables and not Constants?)

Page 86: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AFTER FOLDING BATCH NORMS

§ Optimizations§ strip_unused_nodes§ remove_nodes§ fold_constants§ fold_batch_norms

§ Results§ Graph remains the same§ File size approximately the same

Page 87: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AFTER QUANTIZING WEIGHTS

§ Optimizations§ strip_unused_nodes§ remove_nodes§ fold_constants§ fold_batch_norms§ quantize_weights

§ Results§ Graph is same, file size is smaller, compute is faster

Page 88: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

WEIGHT QUANTIZATION

§ FP16 and INT8 Are Smaller and Computationally Simpler§ Weights/Variables are Constants§ Easy to Linearly Quantize

Page 89: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S OPTIMIZE FOR INFERENCE§ Navigate to the following notebook:

07_Optimize_Model**Why just CPU version? Why not GPU?

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 90: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BUT WAIT, THERE’S MORE!

Page 91: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

ACTIVATION QUANTIZATION§ Activations Not Known Ahead of Time

§ Depends on input, not easy to quantize§ Requires Additional Calibration Step

§ Use a “representative” dataset§ Per Neural Network Layer…

§ Collect histogram of activation values§ Generate many quantized distributions with different saturation thresholds§ Choose threshold to minimize…

KL_divergence(ref_distribution, quant_distribution)

§ Not Much Time or Data is Required (Minutes on Commodity Hardware)

Page 92: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AFTER ACTIVATION QUANTIZATION

§ Optimizations§ strip_unused_nodes§ remove_nodes§ fold_constants§ fold_batch_norms§ quantize_weights§ quantize_nodes (activations)

§ Results§ Larger graph, needs calibration!

Requires additional freeze_requantization_ranges

Page 93: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S OPTIMIZE FOR INFERENCE§ Navigate to the following notebook:

08_Optimize_Model_Activations

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 94: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

FREEZING MODEL FOR DEPLOYMENT§ Optimizations

§ strip_unused_nodes§ remove_nodes§ fold_constants§ fold_batch_norms§ quantize_weights§ quantize_nodes§ freeze_graph

§ Results§ Variables -> Constants

Finally!We’re Ready to Deploy!!

Page 95: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 2: TensorFlow Model Serving

§ AOT XLA Compiler and Graph Transform Tool§ Key Components of TensorFlow Serving§ Deploy Optimized TensorFlow Model§ Optimize TensorFlow Serving Runtime

Page 96: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

MODEL SERVING TERMINOLOGY§ Inference

§ Only Forward Propagation through Network§ Predict, Classify, Regress, …

§ Bundle§ GraphDef, Variables, Metadata, …

§ Assets§ ie. Map of ClassificationID -> String§ {9283: “penguin”, 9284: “bridge”}

§ Version§ Every Model Has a Version Number (Integer)

§ Version Policy§ ie. Serve Only Latest (Highest), Serve Both Latest and Previous, …

Page 97: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

TENSORFLOW SERVING FEATURES§ Supports Auto-Scaling§ Custom Loaders beyond File-based§ Tune for Low-latency or High-throughput§ Serve Diff Models/Versions in Same Process§ Customize Models Types beyond HashMap and TensorFlow§ Customize Version Policies for A/B and Bandit Tests§ Support Request Draining for Graceful Model Updates§ Enable Request Batching for Diff Use Cases and HW§ Supports Optimized Transport with GRPC and Protocol Buffers

Page 98: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PREDICTION SERVICE§ Predict (Original, Generic)

§ Input: List of Tensor§ Output: List of Tensor

§ Classify§ Input: List of tf.Example (key, value) pairs§ Output: List of (class_label: String, score: float)

§ Regress§ Input: List of tf.Example (key, value) pairs§ Output: List of (label: String, score: float)

Page 99: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

PREDICTION INPUTS + OUTPUTS§ SignatureDef

§ Defines inputs and outputs§ Maps external (logical) to internal (physical) tensor names§ Allows internal (physical) tensor names to change

from tensorflow.python.saved_model import utilsfrom tensorflow.python.saved_model import signature_constantsfrom tensorflow.python.saved_model import signature_def_utils

graph = tf.get_default_graph()x_observed = graph.get_tensor_by_name('x_observed:0') y_pred = graph.get_tensor_by_name('add:0') inputs_map = {'inputs': x_observed} outputs_map = {'outputs': y_pred} predict_signature = signature_def_utils.predict_signature_def(inputs=inputs_map,

outputs=outputs_map)

Page 100: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

MULTI-HEADED INFERENCE

§ Multiple “Heads” of Model§ Return class and scores to be fed into another model§ Inputs Propagated Forward Only Once§ Optimizes Bandwidth, CPU, Latency, Memory, Coolness

Page 101: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BUILD YOUR OWN MODEL SERVER§ Adapt GRPC(Google) <-> HTTP (REST of the World)§ Perform Batch Inference vs. Request/Response§ Handle Requests Asynchronously§ Support Mobile, Embedded Inference§ Customize Request Batching§ Add Circuit Breakers, Fallbacks§ Control Latency Requirements§ Reduce Number of Moving Parts

#include “tensorflow_serving/model_servers/server_core.h”

class MyTensorFlowModelServer {ServerCore::Options options;

// set options (model name, path, etc)std::unique_ptr<ServerCore> core;

TF_CHECK_OK(ServerCore::Create(std::move(options), &core)

);}

Compile and Link withlibtensorflow.so

Page 102: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

NVIDIA TENSOR-RT RUNTIME

§ Post-Training Model Optimizations§ Similar to TF Graph Transform Tool

§ GPU-Optimized Prediction Runtime§ Alternative to TensorFlow Serving

§ PipelineAI Supports TensorRT!

Page 103: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 2: TensorFlow Model Serving

§ AOT XLA Compiler and Graph Transform Tool§ Key Components of TensorFlow Serving§ Deploy Optimized TensorFlow Model§ Optimize TensorFlow Serving Runtime

Page 104: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

SAVED MODEL FORMAT§ Navigate to the following notebook:

09_Deploy_Optimized_Model

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 105: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 2: TensorFlow Model Serving

§ AOT XLA Compiler and Graph Transform Tool§ Key Components of TensorFlow Serving§ Deploy Optimized TensorFlow Model§ Optimize TensorFlow Serving Runtime

Page 106: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

REQUEST BATCH TUNING§ max_batch_size

§ Enables throughput/latency tradeoff§ Bounded by RAM

§ batch_timeout_micros§ Defines batch time window, latency upper-bound§ Bounded by RAM

§ num_batch_threads§ Defines parallelism§ Bounded by CPU cores

§ max_enqueued_batches§ Defines queue upper bound, throttling§ Bounded by RAM

Reaching either thresholdwill trigger a batch

Page 107: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

BATCH SCHEDULER STRATEGIES§ BasicBatchScheduler

§ Best for homogeneous request types (ie. always classify or always regress)§ Async callback upon max_batch_size or batch_timeout_micros§ BatchTask encapsulates unit of work to be batched

§ SharedBatchScheduler§ Best for heterogeneous request types, multi-step inference, ensembles, …§ Groups BatchTasks into separate queues to form homogenous batches§ Processes batches fairly through interleaving

§ StreamingBatchScheduler§ Mixed CPU/GPU/IO-bound workloads§ Provides fine-grained control for complex, multi-phase inference logic

Experiment to Find the Best Batching Strategy for You!!

Co-locate Similar Prediction Workloads

Page 108: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

LET’S DEPLOY OPTIMIZED MODEL§ Navigate to the following notebook:

10_Optimize_Model_Server

§ https://github.com/PipelineAI/pipeline/tree/master/gpu.ml/notebooks

Page 109: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

AGENDA

Part 0: PipelineAI Research

Part 1: TensorFlow Model Training

Part 2: TensorFlow Model Serving

Page 110: Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with GPUs - Washington DC - Nov 2017

THANK YOU!! QUESTIONS?§ https://github.com/PipelineAI/pipeline/

§ Please ! this GitHub Repo!

§ All slides, code, notebooks, and Docker images here:https://github.com/PipelineAI/pipeline/tree/master/gpu.ml

Contact [email protected]

@cfregly