End to End Optimization Stack for Deep...
Transcript of End to End Optimization Stack for Deep...
End to End Optimization Stack for Deep Learning
Presenter: Tianqi Chen
Paul G. Allen School of Computer Science & Engineering University of Washington
Collaborators
Tianqi Chen Thierry Moreau Haichen Shen Ziheng Jiang
Carlos Guestrin Luis Ceze Arvind Krishnamurthy
ML, Software Stack Hardware Stack GPU ARM, NNVM pipeline
University of Washington AWS AI Team
and many more contributors in the DMLC community
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Frameworks
Operator Libraries cuDNN, NNPack, MKL-DNN
CNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
Built a new accelerator
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
Built a new accelerator
Need entire software stack on top of it!
Layout transformation Quantization Operator kernel optimization Benchmarking ….
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout Optimization
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout OptimizationOperator Fusion
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout OptimizationOperator Fusion
Need optimized hardware kernel for each variant, on each hardware!
FrameworksCNTK
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout OptimizationOperator Fusion
Need optimized hardware kernel for each variant, on each hardware!
FrameworksCNTK
Serving
The End to End System Challenge
Hardware Back-Ends
Frameworks
The End to End System Challenge
Hardware Back-Ends
Frameworks
The End to End System Challenge
Hardware Back-Ends
Frameworks
The End to End System Challenge
Hardware Back-Ends
Frameworks
The End to End System Challenge
Hardware Back-Ends
Frameworks
The End to End System Challenge
Hardware Back-Ends
Frameworks
The End to End System Challenge
Hardware Back-Ends
Frameworks
The End to End System Challenge
Hardware Back-Ends
Frameworks
Intermediate representation
Computational Graph IR and Remaining Gap
Auto DifferentiationMemory Plan
Operator Fusion
Computational Graph
Backends
Examples: NGraph, XLA, NNVM, DLVM …
Computational Graph IR and Remaining Gap
Auto DifferentiationMemory Plan
Operator Fusion
Computational Graph
Backends
Computational Graph IR and Remaining Gap
Auto DifferentiationMemory Plan
Operator Fusion
Computational Graph
Backends
too many possible choices: precision, layout, fused pattern, device, threading …
Need a low level IR to express them explicitly
TVM: Low Level IR
TVM
Memory Plan
NNVM GraphAuto Differentiation
Framework
Hardware backends
• Concise and compact description
• Explicit control on codegen
• Ease of deployment
• Support new hardware backends
Tensor Index Expression Declaration
import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k))
Inputs
Shape of C
Compute C = dot(A, B.T)
Computation Rule
Challenge: Hardware Diversities
IR
Challenge: Hardware Diversities
IR
CPU GPU Accelerators
Challenge: Hardware Diversities
IR
CPU GPU Accelerators
L2
RF RFTX/L1
SM
RF RFTX/L1
SM
L1D L1IL2
L3
L1D L1IL2
Unified Buffer Acc
FIFOMemory
subsystem
implicitly managed mixed explicitly managed
Challenge: Hardware Diversities
IR
CPU GPU Accelerators
L2
RF RFTX/L1
SM
RF RFTX/L1
SM
L1D L1IL2
L3
L1D L1IL2
Unified Buffer Acc
FIFOMemory
subsystem
implicitly managed mixed explicitly managed
Compute primitives
scalar vector tensor
Challenge: Hardware Diversities
IR
CPU GPU Accelerators
L2
RF RFTX/L1
SM
RF RFTX/L1
SM
L1D L1IL2
L3
L1D L1IL2
Unified Buffer Acc
FIFOMemory
subsystem
implicitly managed mixed explicitly managed
Compute primitives
scalar vector tensor
fp32 fp16 int8Data type
Unified Schedule Optimizations for Hardwares
Generated code
(LLVM, CUDA, OpenCL…)
Lowering
Algorithm described in IR
Unified Schedule Optimizations for Hardwares
Generated code
(LLVM, CUDA, OpenCL…)
Lowering
Algorithm described in IR
Scheduling Optimization
Unified Schedule Optimizations for Hardwares
Scheduling Optimizations(✔) Data layout
Generated code
(LLVM, CUDA, OpenCL…)
Lowering
Algorithm described in IR
Scheduling Optimization
Unified Schedule Optimizations for Hardwares
Scheduling Optimizations(✔) Data layout(✔) Tiling
Generated code
(LLVM, CUDA, OpenCL…)
Lowering
Algorithm described in IR
Scheduling Optimization
Unified Schedule Optimizations for Hardwares
Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation
Generated code
(LLVM, CUDA, OpenCL…)
Lowering
Algorithm described in IR
Scheduling Optimization
Unified Schedule Optimizations for Hardwares
Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation(✔) Latency hiding Generated
code (LLVM, CUDA,
OpenCL…)
Lowering
Algorithm described in IR
Scheduling Optimization
Unified Schedule Optimizations for Hardwares
Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation(✔) Latency hiding(✔) Tensorization
Generated code
(LLVM, CUDA, OpenCL…)
Lowering
Algorithm described in IR
Scheduling Optimization
Separation of Compilation and Deployment
TVM RuntimesCompilation Stack
Lightweight, 300 to 600 KBHeavy optimizations
Deploy
TVM
NNVM
TVM Graph Module
Framework Frontends
Remote Execution and Profiling
Server with TVM Compiler
Devices with TVM Runtime
TVM RPC
Performance Portable against state of art
0
2
4
6
8
ResNet18 MobileNet
MXNetNNVM Compiler
Time cost(ms)
K80, Baseline MXNet with cuDNN auto tune enabled
One grad student month
1.2x
1.2x
0
750
1500
2250
3000
ResNet18 MobileNet
MXNetNNVM Compiler
Time cost(ms)
Raspberry Pi 3 Baseline: MXNet with OpenBLAS and NNPack
Two undergrad weeks
2.2x
11.5x
Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)
Coming Soon: Target New Accelerators
Tensorization
Latency Hiding
FPGA Example for building new hardware backend
Open-source soon
NNVM Compiler: Open Compiler for AI Systems
TVM
NNVM
LLVMOpenCL Metal CUDAMore hardware backends
X86 ARM Javascript/WASM
MXNetKeras
CoreML
PyTorch Caffe2 CNTK
ONNX
Caffe
Graph Optimizations
TVM Primitives
External SupportSupported Work in progress
AMDGPUs
Joint Work with AWS AI Team and DMLC community
Deep Learning System Research is Exciting but Hard
Computational graph
Hardware
Frameworks
Operator Libraries cuDNN, NNPack, MKL-DNN
CNTK
Deep Learning System Research is Just Exciting
Hardware
FrameworksCNTK
TVM
NNVM
Graph Optimizations
TVM PrimitivesI can program my new accelerators from python :)
My new optimizations works on all platforms !
Deep Learning System Research is Just Exciting
Hardware
FrameworksCNTK
TVM
NNVM
Graph Optimizations
TVM Primitives
You can be part of it!
I can program my new accelerators from python :)
My new optimizations works on all platforms !