December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for...
Transcript of December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for...
![Page 1: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/1.jpg)
1st TVM and Deep Learning Compilation Conference
December 12, 2018
![Page 2: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/2.jpg)
Luis Ceze
![Page 3: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/3.jpg)
Welcome to the 1st TVM and Deep Learning Compilation Conference!
![Page 4: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/4.jpg)
180+ ppl!Welcome to the 1st TVM and Deep Learning Compilation Conference!
![Page 5: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/5.jpg)
Machine learning is amazing…
![Page 6: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/6.jpg)
Machine learning is amazing…
super human accuracy, self driving cars, automated scientific
discoveries…
![Page 7: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/7.jpg)
Machine learning is amazing…
super human accuracy, self driving cars, automated scientific
discoveries…
wow!
![Page 8: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/8.jpg)
Problem to solve Write code Run on fast machineSoftware era:
![Page 9: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/9.jpg)
Problem to solve Write code Run on fast machineSoftware era:
![Page 10: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/10.jpg)
Problem to solve Write code Run on fast machineSoftware era:
Train on fastest machineProblem to solve Data + model
templatesInference on fast &
cheap machineMachine learning era:
![Page 11: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/11.jpg)
Problem to solve Write code Run on fast machineSoftware era:
Train on fastest machineProblem to solve Data + model
templatesInference on fast &
cheap machineMachine learning era:
by Eugenio Culurciello
Model size and compute cost growing fast
![Page 12: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/12.jpg)
Problem to solve Write code Run on fast machineSoftware era:
Train on fastest machineProblem to solve Data + model
templatesInference on fast &
cheap machineMachine learning era:
by Eugenio Culurciello
Model size and compute cost growing fast
by Open AI
Training costs growing exponentially
![Page 13: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/13.jpg)
![Page 14: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/14.jpg)
Popularity and computational cost of
ML. Oops.
![Page 15: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/15.jpg)
Popularity and computational cost of
ML. Oops.Fundamental trade-off between specialization and performance/efficiency.
![Page 16: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/16.jpg)
Popularity and computational cost of
ML. Oops.
MoreGeneral/Programmable
Be0erPerform
ance/EnergyEffi
ciency
FixedFunc*onChips
FPGA
GPUs
GeneralPurposeCPUs
Fundamental trade-off between specialization and performance/efficiency.
![Page 17: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/17.jpg)
Popularity and computational cost of
ML. Oops.
MoreGeneral/Programmable
Be0erPerform
ance/EnergyEffi
ciency
FixedFunc*onChips
FPGA
GPUs
GeneralPurposeCPUs
Machine learning algorithms are relatively simple to implement in HW… great!
Fundamental trade-off between specialization and performance/efficiency.
![Page 18: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/18.jpg)
Popularity and computational cost of
ML. Oops.
MoreGeneral/Programmable
Be0erPerform
ance/EnergyEffi
ciency
FixedFunc*onChips
FPGA
GPUs
GeneralPurposeCPUs
Machine learning algorithms are relatively simple to implement in HW… great!
Machine Learning Makes Computer Architecture Cool Again!
Fundamental trade-off between specialization and performance/efficiency.
![Page 19: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/19.jpg)
…
![Page 20: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/20.jpg)
…+~50 startups
![Page 21: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/21.jpg)
![Page 22: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/22.jpg)
…+~50 startups
![Page 23: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/23.jpg)
CNNGAN
RNNMLP
DQNNModels:
…+~50 startups
![Page 24: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/24.jpg)
CNNGAN
RNNMLP
DQNNModels:
Frameworks:
…+~50 startups
![Page 25: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/25.jpg)
Challenge: Efficiently deploying deep learning everywhere
CNNGAN
RNNMLP
DQNNModels:
Frameworks:
…+~50 startups
![Page 26: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/26.jpg)
Gaurav Kapoor, Core Machine Learning
![Page 27: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/27.jpg)
HW+SW optimization is key for efficiency
![Page 28: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/28.jpg)
HW+SW optimization is key for efficiency
Lots of hand-tuning, full automation would be a holy grail
![Page 29: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/29.jpg)
Academic group focused on Systems + Computer Architecture + Machine Learning + Programming Languages
![Page 30: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/30.jpg)
Academic group focused on Systems + Computer Architecture + Machine Learning + Programming Languages
![Page 31: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/31.jpg)
Academic group focused on Systems + Computer Architecture + Machine Learning + Programming Languages
![Page 32: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/32.jpg)
Computer Architecture: Extensible, energy efficient hardware designs for inference and training
Compilers: Extensible support for future models, optimizations and hardware architectures
PL: High-level support for future ML applications
Systems: On-device and cloud-based training, distributed systems for ML
![Page 33: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/33.jpg)
Computer Architecture: Extensible, energy efficient hardware designs for inference and training
Compilers: Extensible support for future models, optimizations and hardware architectures
PL: High-level support for future ML applications
Systems: On-device and cloud-based training, distributed systems for ML
ML for Systems:Automatic Learning-Based Design and
Optimizations
![Page 34: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/34.jpg)
Computer Architecture: Extensible, energy efficient hardware designs for inference and training
Compilers: Extensible support for future models, optimizations and hardware architectures
PL: High-level support for future ML applications
Systems: On-device and cloud-based training, distributed systems for ML
ML for Systems:Automatic Learning-Based Design and
Optimizations
ML for better ML systems!
![Page 35: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/35.jpg)
Computer Architecture: Extensible, energy efficient hardware designs for inference and training
Compilers: Extensible support for future models, optimizations and hardware architectures
PL: High-level support for future ML applications
Systems: On-device and cloud-based training, distributed systems for ML
ML for Systems:Automatic Learning-Based Design and
Optimizations
ML for better ML systems!
![Page 36: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/36.jpg)
Open Source Deployment
Provides infrastructure
Open source when ready
Computer Architecture: Extensible, energy efficient hardware designs for inference and training
Compilers: Extensible support for future models, optimizations and hardware architectures
PL: High-level support for future ML applications
Systems: On-device and cloud-based training, distributed systems for ML
ML for Systems:Automatic Learning-Based Design and
Optimizations
ML for better ML systems!
![Page 37: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/37.jpg)
First major open source compiler collection
Open source compilers have transformed our industry
![Page 38: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/38.jpg)
First major open source compiler collection
LLVM: Higher-Level IR, new optimizations, easier extensibility
Open source compilers have transformed our industry
![Page 39: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/39.jpg)
In the age of domain-specialized systems…
First major open source compiler collection
LLVM: Higher-Level IR, new optimizations, easier extensibility
Open source compilers have transformed our industry
![Page 40: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/40.jpg)
In the age of domain-specialized systems…
First major open source compiler collection
LLVM: Higher-Level IR, new optimizations, easier extensibility
Open source compilers have transformed our industry
Specialized compiler stack for Deep Learning
![Page 41: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/41.jpg)
End the tyranny of closed deep learning systems!
![Page 42: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/42.jpg)
Tianqi Chen
![Page 43: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/43.jpg)
![Page 44: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/44.jpg)
High-Level Differentiable IR
![Page 45: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/45.jpg)
High-Level Differentiable IR
Tensor Expression IR
![Page 46: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/46.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal
![Page 47: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/47.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA ASIC
![Page 48: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/48.jpg)
import tvm from tvm import relay graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)
Compile
![Page 49: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/49.jpg)
import tvm from tvm import relay graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)
Compile
![Page 50: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/50.jpg)
import tvm from tvm import relay graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)
Compile Deployable Module
tabby, tabby cat
module = runtime.create(graph, lib, tvm.gpu(0))module.set_input(**params)module.run(data=data_array)output = tvm.nd.empty(out_shape, ctx=tvm.gpu(0))module.get_output(0, output)
prediction
input
Deploy
![Page 51: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/51.jpg)
On languages and platforms you choose
import tvm from tvm import relay graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)
Compile Deployable Module
tabby, tabby cat
module = runtime.create(graph, lib, tvm.gpu(0))module.set_input(**params)module.run(data=data_array)output = tvm.nd.empty(out_shape, ctx=tvm.gpu(0))module.get_output(0, output)
prediction
input
Deploy
![Page 52: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/52.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Automated by Machine Learning
![Page 53: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/53.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Optimization
AutoTVM
AutoVTA
Hardware Fleet
Automated by Machine Learning
![Page 54: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/54.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Optimization
AutoTVM
AutoVTA
Hardware Fleet
Automated by Machine Learning
![Page 55: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/55.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Optimization
AutoTVM
AutoVTA
Hardware Fleet
Automated by Machine Learning
![Page 56: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/56.jpg)
Diverse Hardware backends
High-Level Differentiable IR
Tensor Expression IR
Optimization
AutoTVM
![Page 57: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/57.jpg)
Diverse Hardware backends
High-Level Differentiable IR
Tensor Expression IR
Optimization
AutoTVM
LLVM
ARM x86 AMDGPU
NVPTX Javascript WASM
![Page 58: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/58.jpg)
Diverse Hardware backends
High-Level Differentiable IR
Tensor Expression IR
Optimization
AutoTVM
LLVM
ARM x86 AMDGPU
NVPTX Javascript WASM
CUDA
Metal
Vulkan
C
![Page 59: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/59.jpg)
Diverse Hardware backends
High-Level Differentiable IR
Tensor Expression IR
VTA
Optimization
AutoTVM
LLVM
ARM x86 AMDGPU
NVPTX Javascript WASM
CUDA
Metal
Vulkan
C
![Page 60: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/60.jpg)
TVM Open Source Community
![Page 61: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/61.jpg)
TVM Open Source Community
Apache governance model: grant project ownership by merit.
11 committers, 29 reviewers, 166 contributors.
Contributed by the community, for the community.
![Page 62: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/62.jpg)
Industrial Impact
![Page 63: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/63.jpg)
Vin Sharma, Amazon SageMaker Neo
Amazon: vinarm@ | Twitter: ciphr@
TVM + AWS
![Page 64: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/64.jpg)
How is AWS using TVM?
![Page 65: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/65.jpg)
How is AWS using TVM?
• As a back-end for Apache MXNet• To deploy easily onto edge devices
• To improve performance on target hardware
![Page 66: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/66.jpg)
How is AWS using TVM?
• As a back-end for Apache MXNet• To deploy easily onto edge devices
• To improve performance on target hardware
• As an optimizer for Amazon AI services• Amazon Rekognition: To improve end-to-end latency
• Amazon Alexa: To increase resource efficiency on Echo/Dot
![Page 67: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/67.jpg)
How is AWS using TVM?
• As a back-end for Apache MXNet• To deploy easily onto edge devices
• To improve performance on target hardware
• As an optimizer for Amazon AI services• Amazon Rekognition: To improve end-to-end latency
• Amazon Alexa: To increase resource efficiency on Echo/Dot
• In a tool chain for Amazon Inferentia
![Page 68: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/68.jpg)
How is AWS using TVM?
• As a back-end for Apache MXNet• To deploy easily onto edge devices
• To improve performance on target hardware
• As an optimizer for Amazon AI services• Amazon Rekognition: To improve end-to-end latency
• Amazon Alexa: To increase resource efficiency on Echo/Dot
• In a tool chain for Amazon Inferentia
We’re Hiring!
![Page 69: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/69.jpg)
How is AWS enabling adoption of TVM?In a new service called Amazon SageMaker Neo
![Page 70: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/70.jpg)
How is AWS enabling adoption of TVM?In a new service called Amazon SageMaker Neo
Model input files: MXNet: .json & .params
Framework
Output Location
Name and shape of input node: {“data”:[1,3,227,277]}
Target Platform: Cloud Instance Type | Edge Device
![Page 71: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/71.jpg)
How is AWS enabling adoption of TVM?In a new service called Amazon SageMaker Neo
Model input files: MXNet: .json & .params
Framework
Output Location
Name and shape of input node: {“data”:[1,3,227,277]}
Target Platform: Cloud Instance Type | Edge Device
We’re Hiring!
![Page 72: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/72.jpg)
How is AWS contributing to TVM?Releasing all TVM modifications and enhancements in Neo to open source• Frameworks: TensorFlow, MXNet, PyTorch, ONNX
• Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet
• Operators: Several new ops in NNVM/TVM
• Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning
• Acceleration Library: Nvidia TensorRT
• Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon
![Page 73: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/73.jpg)
How is AWS contributing to TVM?Releasing all TVM modifications and enhancements in Neo to open source• Frameworks: TensorFlow, MXNet, PyTorch, ONNX
• Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet
• Operators: Several new ops in NNVM/TVM
• Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning
• Acceleration Library: Nvidia TensorRT
• Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon
We’re Hiring!
![Page 74: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/74.jpg)
![Page 75: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/75.jpg)
Chen Tian, Technical VP
![Page 76: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/76.jpg)
Huawei Confidential
TVM on Huawei’s AI portfolio
CANN(Compute Architecture for Neural Networks)
Ascend
…
Ascend-MaxAscend-MiniAscend-Tiny
Ascend-Lite
Ascend-Nano
AI Applications
Consumer Device Public Cloud Private Cloud
Industrial IoT Device
Edge Computing
Application Enablement
Framework
Chip Enabler
IP & Chip
General APIs1
Advanced APIs
Pre-integrated SolutionsHiAI Service
HiAI Engine
ModelArts
MindSpore TensorFlow PyTorch PaddlePaddle
CCE lib/extensions Tensor Engine / TVM
Application enabling:Full-pipeline services(ModelArts), hierarchical APIs, and pre-integrated solutions
MindSpore: Unified training and inference framework for device, edge, and cloud (both standalone and cooperative)
CANN: Chip operators library and highly automated operators development toolkit
Ascend:AI chip series based on unified scalable architecture
![Page 77: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/77.jpg)
Huawei Confidential
Frameworks
model execution
TE/TVM
During model conversion we use TE/TVM to customize operators for completeness and performance.
Third-Party Operators
Model Conversion
How do we use TVM
70+ operators are written by TVM , bring us ~3x development efficiency improvement
![Page 78: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/78.jpg)
Huawei Confidential!32
Successful Practice with Audi in Level 4 Autonomous Driving~ A Complete City Commute Record ~
Driving in the evening
Traffic light identification Pedestrian identification
High-speed cruise Traffic Jam Pilot (TJP)
Automatic parking
Joint developed autonomous driving algorithm gains leading scores in industry authoritative KITTI 2D/3D/BEV tests!
![Page 79: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/79.jpg)
Huawei Confidential!33
Smart Manufacturing(intelligent quality inspection and flexible manufacturing)
Intelligent Care(kindergarten and elderly care)
Smart Transportation (traffic light tuning, intelligent traffic guiding)
Atlas 200 Developer Kit
● 16 TOPS INT8@24 W● 1 USB type-C, 2 CCM interfaces, 1
GE network port, 1 SD card slot● 8 GB memory
Atlas 300 AI Accelerator Card
● 64 TOPS INT8@75 W● 64-channel HD video real-time analysis
and JPEG decoding● 32 GB memory, 204.8 GB/s memory
bandwidth● PCIe 3.0 x16, half-height half-length card
Atlas 800 AI Appliance
• Provides optimized AI environment based on the standard framework and programming environment
• Leverages high-performance GPU scheduling algorithms, improving resource utilization by over 15%
• Capable of processing 16-channel HD videos in the size of a set-top-box (STB)
• Delivers 4x higher performance over counterparts
Atlas 500 AI Edge Station
TVM is working on Atlas series product
![Page 80: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/80.jpg)
HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential �34
Huawei’s Contributions on TVM8 Contributors:kun-zh, sgrechanik-h, libing4752, derisavi-huawei, solin319, ehsanmok, gaoxiong-1, jiacunjiang1215
4 Reviewers:Srkreddy1238 , PariksheetPinjari909 , siju-Samuel , Xqdan
We are working on:1.Huawei Ascend ASIC support.2.Front end to support Darknet, ONNX.3.Optimization on Auto-TVM, IR extensions.4.Tensorize, cache read/write, access_ptr API.
In the future we will try to:1.Codegen for fused operators.2.NLP support.3.More optimization.4.Training Operators.
![Page 81: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/81.jpg)
Meghan Cowan
![Page 82: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/82.jpg)
TensorflowLite32bit fp
66% top-1ImageNet accuracy1.42 fps
VGG11 on Raspberry Pi 3B
![Page 83: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/83.jpg)
TensorflowLite32bit fp
66% top-1ImageNet accuracy1.42 fps
Trained binarized model
Operators implemented with TVM
VGG11 on Raspberry Pi 3B
![Page 84: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/84.jpg)
TVM 2-bit activation 1-bit weight
62% top-1 ImageNet accuracy4.67 fps
TensorflowLite32bit fp
66% top-1ImageNet accuracy1.42 fps
Trained binarized model
Operators implemented with TVM
VGG11 on Raspberry Pi 3B
![Page 85: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/85.jpg)
Further down the stack…
![Page 86: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/86.jpg)
Thierry Moreau
![Page 87: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/87.jpg)
Open Source Stack Overview
High-Level Differentiable IR
Tensor Expression IR
VTA Runtime & JIT Compiler
VTA Hardware/Software Interface (ISA)
VTA MicroArchitecture VTA Simulator
Versatile Tensor Accelerator
Stack(VTA)
![Page 88: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/88.jpg)
Open Source Stack OverviewVTA Backends
• Simulator: out-of-the-box testing to write compiler passes
High-Level Differentiable IR
Tensor Expression IR
VTA Runtime & JIT Compiler
VTA Hardware/Software Interface (ISA)
VTA MicroArchitecture VTA Simulator
Versatile Tensor Accelerator
Stack(VTA)
![Page 89: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/89.jpg)
Open Source Stack OverviewVTA Backends
• Simulator: out-of-the-box testing to write compiler passes
• FPGA: fast design iteration, quick deployment, flexibility
High-Level Differentiable IR
Tensor Expression IR
VTA Runtime & JIT Compiler
VTA Hardware/Software Interface (ISA)
VTA MicroArchitecture VTA Simulator
Versatile Tensor Accelerator
Stack(VTA)
![Page 90: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/90.jpg)
Open Source Stack OverviewVTA Backends
• Simulator: out-of-the-box testing to write compiler passes
• FPGA: fast design iteration, quick deployment, flexibility
• ASIC: industrial-strength efficiency
High-Level Differentiable IR
Tensor Expression IR
VTA Runtime & JIT Compiler
VTA Hardware/Software Interface (ISA)
VTA MicroArchitecture VTA Simulator
Versatile Tensor Accelerator
Stack(VTA)
![Page 91: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/91.jpg)
Hardware Exploration with VTA
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
Deliverable
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
![Page 92: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/92.jpg)
Hardware Exploration with VTA
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
Deliverable
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
Deliverable
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
1000s
![Page 93: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/93.jpg)
Hardware Exploration with VTA
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
Deliverable
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
Deliverable
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
Deliverable
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
1000s ~10
![Page 94: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/94.jpg)
Schedule Exploration with VTA
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Deliverable
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
![Page 95: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/95.jpg)
Schedule Exploration with VTA
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Deliverable
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Deliverable
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
![Page 96: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/96.jpg)
Schedule Exploration with VTA
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Deliverable
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Deliverable
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
HW / SW Constraints
FPGA# BRAMs
DRAM channels
logic resources
Model batch size
data types
channel width
Architecture Knobs
GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)
# of units in tensor ALU : e.g. 32 vs. 16
BRAM allocation between buffers, register file, micro-op cache
Circuit Knobs
Circuit Pipelining: e.g. for GEMM core between [11, 20] stages
PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz
{VTA Design Space
}VTA Candidate Designs
#1 Design AAA @ 307GOPs
#2 Design BBB @ 307GOPs
#3 Design CCC @ 307GOPs
#4 Design DDD @ 256GOPs
Needs to pass place & routeand pass timing closure
307 GOPs
256 GOPs
thro
ughp
ut
autotuning steps
Operator Performance AutoTuning
Deliverable
Tuned Operator Lib
VTA Design BBB
FPGA
Graph Optimizer
Model
cust
om
![Page 97: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/97.jpg)
TVM+VTA Stack Goals
![Page 98: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/98.jpg)
TVM+VTA Stack Goals
• Blue-print for a complete deep learning acceleration stack
![Page 99: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/99.jpg)
TVM+VTA Stack Goals
• Blue-print for a complete deep learning acceleration stack
• Experimentation framework for cross-stack deep learning optimizations
![Page 100: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/100.jpg)
TVM+VTA Stack Goals
• Blue-print for a complete deep learning acceleration stack
• Experimentation framework for cross-stack deep learning optimizations
• Open-source community for industrial-strength deep learning acceleration
![Page 101: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/101.jpg)
Carlos Guestrin
![Page 102: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/102.jpg)
Training Deep Learning Models with TVM
![Page 103: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/103.jpg)
Jared Roesch
![Page 104: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/104.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Model
Standalone inference deployment
![Page 105: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/105.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Model
Standalone inference deployment
![Page 106: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/106.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Model
AutomaticDifferentiation
Gradient Program for Training
![Page 107: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/107.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Model
Standalone training deployment
AutomaticDifferentiation
Gradient Program for Training
![Page 108: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/108.jpg)
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC
Model
Standalone training deployment
• Automatic generation of gradient programs
• Support for customized data types and FPGA training
• Support for distributed execution, and integration with technology such as PHub (see Liang’s talk).
More details on the Relay talk later today!
AutomaticDifferentiation
Gradient Program for Training
![Page 109: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/109.jpg)
Road ahead…
![Page 110: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/110.jpg)
On the horizon…
Automation
Training
Hardware
![Page 111: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/111.jpg)
On the horizon…
Automation
Training
Hardware
AutoDiff with Relay
Training on-device
Tradeoff accuracy/throughput/Joules
![Page 112: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/112.jpg)
On the horizon…
Automation
Training
Hardware
AutoDiff with Relay
Training on-device
Tradeoff accuracy/throughput/Joules
Autoquantization
Full-program optimization
Automated HW design
![Page 113: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/113.jpg)
On the horizon…
Automation
Training
Hardware
AutoDiff with Relay
Training on-device
Tradeoff accuracy/throughput/Joules
Autoquantization
Full-program optimization
Automated HW design
VTA Chiseldesign
ASICflow
Training onVTA
![Page 114: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/114.jpg)
Big THANKS to our sponsors!
![Page 115: December 12, 2018Huawei Confidential TVM on Huawei’s AI portfolio CANN (Compute Architecture for Neural Networks) ... Unified training and inference framework for device, edge, and](https://reader034.fdocuments.us/reader034/viewer/2022042017/5e750c9787a07c6e0a4e527c/html5/thumbnails/115.jpg)
Keynote, TVM Overview,TVM @ Amazon9:00
Automation, HW Specialization, Security
Boxed lunches12:30
Training, Programming Systems, Hardware13:30
Break, contributors meetup15:20
Compilers, FPGAs15:50
Lightning talks16:30
Community formation17:35
18:10 Social (food, drinks)20:00 adjourn
Break11:05
11:25
Keynote (SAMPL, Apple, Amazon, Huawei)TVM Overview – Tianqi Chen, UWDeep Learning Compilation at Amazon – Yida Wang, Amazon
AutoTVM & Device Fleet – Eddie Yan, UWVTA Open Source Deep Learning Accelerator – Thierry Moreau, UWSecure Enclaves for Deep Learning – Nick Hynes, UC Berkeley/Oasis Labs
Kunle Olukotun/Raghu Prabhakar, Stanford & SambaNovaMachine Programming – Justin Gottschlich, IntelPlaidML Stripe: Polyhedral IR + Model-guided Optimization – Brian Retford, IntelThe Relay Differentiable IR for TVM – Jared Roesch, UWScalable Distributed Training with Parameter Hub – Liang Luo, UWThe HammerBlade ML Supercomputer – Michael Taylor, UW
Andrew Tulloch, FacebookGraham Schelle, Xilinx
Markus Weimer, Microsoft and Apache Software Foundation