End to End Optimization Stack for Deep...

End to End Optimization Stack for Deep Learning

Presenter: Tianqi Chen

Paul G. Allen School of Computer Science & Engineering University of Washington

Collaborators

Tianqi Chen Thierry Moreau Haichen Shen Ziheng Jiang

Carlos Guestrin Luis Ceze Arvind Krishnamurthy

ML, Software Stack Hardware Stack GPU ARM, NNVM pipeline

University of Washington AWS AI Team

and many more contributors in the DMLC community

Deep Learning System Research is Exciting but Hard

Computational graph

Hardware

Frameworks

Operator Libraries cuDNN, NNPack, MKL-DNN

CNTK


Computational graph

Hardware


FrameworksCNTK


Computational graph

Hardware


Built a new accelerator

FrameworksCNTK


Computational graph

Hardware


Built a new accelerator

Need entire software stack on top of it!

Layout transformation Quantization Operator kernel optimization Benchmarking ….

FrameworksCNTK


Computational graph

Hardware


FrameworksCNTK


Computational graph

Hardware


Data Layout Optimization

FrameworksCNTK


Computational graph

Hardware


Data Layout OptimizationOperator Fusion

FrameworksCNTK


Computational graph

Hardware



Need optimized hardware kernel for each variant, on each hardware!

FrameworksCNTK


Computational graph

Hardware



Need optimized hardware kernel for each variant, on each hardware!

FrameworksCNTK

Serving

The End to End System Challenge

Hardware Back-Ends

Frameworks

The End to End System Challenge

Hardware Back-Ends

Frameworks

Intermediate representation

Computational Graph IR and Remaining Gap

Auto DifferentiationMemory Plan

Operator Fusion

Computational Graph

Backends

Examples: NGraph, XLA, NNVM, DLVM …



Operator Fusion

Computational Graph

Backends



Operator Fusion

Computational Graph

Backends

too many possible choices: precision, layout, fused pattern, device, threading …

Need a low level IR to express them explicitly

TVM: Low Level IR

TVM

Memory Plan

NNVM GraphAuto Differentiation

Framework

Hardware backends

• Concise and compact description

• Explicit control on codegen

• Ease of deployment

• Support new hardware backends

Tensor Index Expression Declaration

import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k))

Inputs

Shape of C

Compute C = dot(A, B.T)

Computation Rule

Challenge: Hardware Diversities

IR


IR

CPU GPU Accelerators


IR


L2

RF RFTX/L1

SM

RF RFTX/L1

SM

L1D L1IL2

L3

L1D L1IL2

Unified Buffer Acc

FIFOMemory

subsystem

implicitly managed mixed explicitly managed


IR


L2

RF RFTX/L1

SM

RF RFTX/L1

SM

L1D L1IL2

L3

L1D L1IL2

Unified Buffer Acc

FIFOMemory

subsystem


Compute primitives

scalar vector tensor


IR


L2

RF RFTX/L1

SM

RF RFTX/L1

SM

L1D L1IL2

L3

L1D L1IL2

Unified Buffer Acc

FIFOMemory

subsystem


Compute primitives

scalar vector tensor

fp32 fp16 int8Data type

Unified Schedule Optimizations for Hardwares

Generated code

(LLVM, CUDA, OpenCL…)

Lowering

Algorithm described in IR


Generated code


Lowering


Scheduling Optimization


Scheduling Optimizations(✔) Data layout

Generated code


Lowering




Scheduling Optimizations(✔) Data layout(✔) Tiling

Generated code


Lowering




Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation

Generated code


Lowering




Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation(✔) Latency hiding Generated

code (LLVM, CUDA,

OpenCL…)

Lowering




Scheduling Optimizations(✔) Data layout(✔) Tiling(✔) Thread cooperation(✔) Latency hiding(✔) Tensorization

Generated code


Lowering



Separation of Compilation and Deployment

TVM RuntimesCompilation Stack

Lightweight, 300 to 600 KBHeavy optimizations

Deploy

TVM

NNVM

TVM Graph Module

Framework Frontends

Remote Execution and Profiling

Server with TVM Compiler

Devices with TVM Runtime

TVM RPC

Performance Portable against state of art

0

2

4

6

8

ResNet18 MobileNet

MXNetNNVM Compiler

Time cost(ms)

K80, Baseline MXNet with cuDNN auto tune enabled

One grad student month

1.2x

1.2x

0

750

1500

2250

3000

ResNet18 MobileNet

MXNetNNVM Compiler

Time cost(ms)

Raspberry Pi 3 Baseline: MXNet with OpenBLAS and NNPack

Two undergrad weeks

2.2x

11.5x

Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)

Coming Soon: Target New Accelerators

Tensorization

Latency Hiding

FPGA Example for building new hardware backend

Open-source soon

NNVM Compiler: Open Compiler for AI Systems

TVM

NNVM

LLVMOpenCL Metal CUDAMore hardware backends

X86 ARM Javascript/WASM

MXNetKeras

CoreML

PyTorch Caffe2 CNTK

ONNX

Caffe

Graph Optimizations

TVM Primitives

External SupportSupported Work in progress

AMDGPUs

Joint Work with AWS AI Team and DMLC community


Computational graph

Hardware

Frameworks


CNTK

Deep Learning System Research is Just Exciting

Hardware

FrameworksCNTK

TVM

NNVM

Graph Optimizations

TVM PrimitivesI can program my new accelerators from python :)

My new optimizations works on all platforms !

Deep Learning System Research is Just Exciting

Hardware

FrameworksCNTK

TVM

NNVM

Graph Optimizations

TVM Primitives

You can be part of it!

I can program my new accelerators from python :)

My new optimizations works on all platforms !

End to End Optimization Stack for Deep...

Documents

Transcript of End to End Optimization Stack for Deep...