End to End Optimization Stack for Deep...

46
End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington

Transcript of End to End Optimization Stack for Deep...

Page 1: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

End to End Optimization Stack for Deep Learning

Presenter: Tianqi Chen

Paul G. Allen School of Computer Science & Engineering University of Washington

Page 2: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Collaborators

Tianqi Chen Thierry Moreau Haichen Shen Ziheng Jiang

Carlos Guestrin Luis Ceze Arvind Krishnamurthy

ML, Software Stack Hardware Stack GPU ARM, NNVM pipeline

University of Washington AWS AI Team

and many more contributors in the DMLC community

Page 3: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Frameworks

Operator Libraries cuDNN, NNPack, MKL-DNN

CNTK

Page 4: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

FrameworksCNTK

Page 5: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

FrameworksCNTK

Page 6: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

Built a new accelerator

FrameworksCNTK

Page 7: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

Built a new accelerator

Need entire software stack on top of it!

Layout transformation Quantization Operator kernel optimization Benchmarking ….

FrameworksCNTK

Page 8: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

FrameworksCNTK

Page 9: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

FrameworksCNTK

Page 10: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout Optimization

FrameworksCNTK

Page 11: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout OptimizationOperator Fusion

FrameworksCNTK

Page 12: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout OptimizationOperator Fusion

Need optimized hardware kernel for each variant, on each hardware!

FrameworksCNTK

Page 13: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout OptimizationOperator Fusion

Need optimized hardware kernel for each variant, on each hardware!

FrameworksCNTK

Serving

Page 14: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Page 15: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Page 16: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Page 17: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Page 18: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Page 19: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Page 20: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Page 21: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

The End to End System Challenge

Hardware Back-Ends

Frameworks

Intermediate representation

Page 22: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Computational Graph IR and Remaining Gap

 

 

Auto DifferentiationMemory Plan

Operator Fusion

Computational Graph

Backends

Examples: NGraph, XLA, NNVM, DLVM …

Page 23: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Computational Graph IR and Remaining Gap

 

 

Auto DifferentiationMemory Plan

Operator Fusion

Computational Graph

Backends

Page 24: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Computational Graph IR and Remaining Gap

 

 

Auto DifferentiationMemory Plan

Operator Fusion

Computational Graph

Backends

too many possible choices: precision, layout, fused pattern, device, threading …

Need a low level IR to express them explicitly

Page 25: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

TVM: Low Level IR

TVM

Memory Plan

NNVM GraphAuto Differentiation

Framework

Hardware backends

• Concise and compact description

• Explicit control on codegen

• Ease of deployment

• Support new hardware backends

Page 26: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Tensor Index Expression Declaration

import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k))

Inputs

Shape of C

Compute C = dot(A, B.T)

Computation Rule

Page 27: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Challenge: Hardware Diversities

IR

Page 28: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

Page 29: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

L1D L1IL2

L3

L1D L1IL2

Unified Buffer Acc

FIFOMemory

subsystem

implicitly managed mixed explicitly managed

Page 30: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

L1D L1IL2

L3

L1D L1IL2

Unified Buffer Acc

FIFOMemory

subsystem

implicitly managed mixed explicitly managed

Compute primitives

scalar vector tensor

Page 31: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

L1D L1IL2

L3

L1D L1IL2

Unified Buffer Acc

FIFOMemory

subsystem

implicitly managed mixed explicitly managed

Compute primitives

scalar vector tensor

fp32 fp16 int8Data type

Page 32: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Unified Schedule Optimizations for Hardwares

Generated code

(LLVM, CUDA, OpenCL…)

Lowering

Algorithm described in IR

Page 33: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Unified Schedule Optimizations for Hardwares

Generated code

(LLVM, CUDA, OpenCL…)

Lowering

Algorithm described in IR

Scheduling Optimization

Page 34: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations(✔) Data layout

Generated code

(LLVM, CUDA, OpenCL…)

Lowering

Algorithm described in IR

Scheduling Optimization

Page 35: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations(✔) Data layout(✔) Tiling

Generated code

(LLVM, CUDA, OpenCL…)

Lowering

Algorithm described in IR

Scheduling Optimization

Page 36: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation

Generated code

(LLVM, CUDA, OpenCL…)

Lowering

Algorithm described in IR

Scheduling Optimization

Page 37: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation(✔) Latency hiding Generated

code (LLVM, CUDA,

OpenCL…)

Lowering

Algorithm described in IR

Scheduling Optimization

Page 38: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation(✔) Latency hiding(✔) Tensorization

Generated code

(LLVM, CUDA, OpenCL…)

Lowering

Algorithm described in IR

Scheduling Optimization

Page 39: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Separation of Compilation and Deployment

TVM RuntimesCompilation Stack

Lightweight, 300 to 600 KBHeavy optimizations

Deploy

TVM

NNVM

TVM Graph Module

Framework Frontends

Page 40: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Remote Execution and Profiling

Server with TVM Compiler

Devices with TVM Runtime

TVM RPC

Page 41: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Performance Portable against state of art

0

2

4

6

8

ResNet18 MobileNet

MXNetNNVM Compiler

Time cost(ms)

K80, Baseline MXNet with cuDNN auto tune enabled

One grad student month

1.2x

1.2x

0

750

1500

2250

3000

ResNet18 MobileNet

MXNetNNVM Compiler

Time cost(ms)

Raspberry Pi 3 Baseline: MXNet with OpenBLAS and NNPack

Two undergrad weeks

2.2x

11.5x

Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)

Page 42: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Coming Soon: Target New Accelerators

Tensorization

Latency Hiding

FPGA Example for building new hardware backend

Open-source soon

Page 43: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

NNVM Compiler: Open Compiler for AI Systems

TVM

NNVM

LLVMOpenCL Metal CUDAMore hardware backends

X86 ARM Javascript/WASM

MXNetKeras

CoreML

PyTorch Caffe2 CNTK

ONNX

Caffe

Graph Optimizations

TVM Primitives

External SupportSupported Work in progress

AMDGPUs

Joint Work with AWS AI Team and DMLC community

Page 44: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Frameworks

Operator Libraries cuDNN, NNPack, MKL-DNN

CNTK

Page 45: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Just Exciting

Hardware

FrameworksCNTK

TVM

NNVM

Graph Optimizations

TVM PrimitivesI can program my new accelerators from python :)

My new optimizations works on all platforms !

Page 46: End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Deep Learning System Research is Just Exciting

Hardware

FrameworksCNTK

TVM

NNVM

Graph Optimizations

TVM Primitives

You can be part of it!

I can program my new accelerators from python :)

My new optimizations works on all platforms !