AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and...
Transcript of AutoTVM & Device FleetLearning to Optimize Tensor Programs High-level data flow graph and...
AutoTVM & Device Fleet`
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Hardware
Frameworks
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Hardware
Frameworks
Machine Learning based Program Optimizer
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Hardware
Frameworks
Machine Learning based Program Optimizer
Learning to Optimize Tensor Programs
High-level data flow graph and optimizations
Learning to generate optimized programfor new operator workloads and hardware
Hardware
Frameworks
Search over Possible Program Transformations
Hardware
Loop Transformations Thread Bindings Cache Locality
Thread Cooperation Tensorization Latency Hiding
C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
Search over Possible Program Transformations
Hardware
Loop Transformations Thread Bindings Cache Locality
Thread Cooperation Tensorization Latency Hiding
C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
Search over Possible Program Transformations
Hardware
Loop Transformations Thread Bindings Cache Locality
Thread Cooperation Tensorization Latency Hiding
Billionsof possibleoptimizationchoices
C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
Learning-based Program Optimizer
!4
Program Optimizer ProgramCode Generator
Learning-based Program Optimizer
Runtime Measurements
!4
Program Optimizer ProgramCode Generator
Learning-based Program Optimizer
Runtime Measurements
High experiment cost, each trial costs ~1second !4
Program Optimizer ProgramCode Generator
Learning-based Program Optimizer
!5
Program Optimizer ProgramCode Generator
Learning-based Program Optimizer
!5
Program Optimizer ProgramCode Generator
Cost Model
Learning-based Program Optimizer
Need reliable cost model per hardware!5
Program Optimizer ProgramCode Generator
Cost Model
Learning-based Program Optimizer
Program Optimizer ProgramCode Generator
Learning-based Program Optimizer
Program Optimizer ProgramCode Generator
D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>
Training data
Learning-based Program Optimizer
Program Optimizer ProgramCode Generator
D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>
Training data
Learning
Statistical Cost Model
Learning-based Program Optimizer
• Relatively low experiment cost• Domain-specific problem structure• Large quantity of similar tasks
Unique ProblemCharacteristics
Program Optimizer ProgramCode Generator
D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>
Training data
Learning
Statistical Cost Model
Program-aware Cost Modeling
High-Level Configuration
Program-aware Cost Modeling
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
Low-level Abstract Syntax Tree (shared between tasks)
Program-aware Cost Modeling
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
Low-level Abstract Syntax Tree (shared between tasks)
C A By 64 64 64x 8 8 64k 1 8 8
y 1x 8k 64
touched memory
outer looplength
statistical features
Boosted Tree Ensembles
Program-aware Cost Modeling
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
Low-level Abstract Syntax Tree (shared between tasks)
for
context vec of x
for
for
context vec of y
context vec of k
+
soft scatter
finalembedding
TreeGRU
C A By 64 64 64x 8 8 64k 1 8 8
y 1x 8k 64
touched memory
outer looplength
statistical features
Boosted Tree Ensembles
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
One Conv2D Layer of ResNet18 on Titan X
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
Number of Trials
One Conv2D Layer of ResNet18 on Titan X
Effectiveness of ML based Model
!8
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
Number of Trials
One Conv2D Layer of ResNet18 on Titan X
Rela
tive
Spee
dup
Baseline: CuDNN
Effectiveness of ML based Model
!9
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
Number of Trials
One Conv2D Layer of ResNet18 on Titan X
Rela
tive
Spee
dup
Baseline: CuDNN
TVM: Random Search
Effectiveness of ML based Model
!10
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
0 100 200 300 400 500 600 700 8000.00
0.25
0.50
0.75
1.00
1.25
1.50
One Conv2D Layer of ResNet18 on Titan X
Rela
tive
Spee
dup
TVM: Random Search
TVM: ML-based Model
Number of Trials
Baseline: CuDNN
Transfer Learning Among Different Workloads
Historical Optimization Tasks
Transfer Learning Among Different Workloads
Historical Optimization Tasks
Domain Invariant Program Representations
Transfer Learning Among Different Workloads
Historical Optimization Tasks
Domain Invariant Program Representations
Transferable Models to speedup new tasks
Transfer Learning Among Different Workloads
Historical Optimization Tasks
Domain Invariant Program Representations
Transferable Models to speedup new tasks
NVIDIA GPU Optimization (GTX 1080 Ti)La
tenc
y (m
s)
0
1.75
3.5
5.25
7
ResNet-50 MobileNet VGG-19 Inception V3 DenseNet-121
MXNet + TensorRT 4.0 AutoTVM
AMD GPU Optimization (Vega FE)La
tenc
y (m
s)
0
1.75
3.5
5.25
7
ResNet-50 MobileNet DenseNet-121
MIOpen AutoTVM
Bonus (INT8, GTX 1080)La
tenc
y (m
s)
0E+00
4E-05
8E-05
1.2E-04
1.6E-04
1-7-512-512-1 4-7-512-512-1 1-7-512-512-3 4-7-512-512-3 1-14-256-256-1 4-14-256-256-1
cuDNN AutoTVM
High Level: Scaling Automatic Performance Profiling
!15
@
@
@
@
Fleet Tracker
Low Level: Portable RPC Tracker + Server
!16
Resource Manager (Tracker)
Nvidia GPU Server
RPC RT
CUDA tasks
Android Phone
RPC RT
OpenCL tasks
AMD GPU Server
RPC RT
ROCmtasks
Zynq FPGA board
RPC RT
JIT driver
Raspberry Pi
RPC RT
ARM tasks
Shared cluster of heterogeneous devices
Optimization Service
RPCclient
Resource token
Resource Allocation
RPC Session Data Path
Optimization Service
RPCclient
crosscompiler
Red modules can be reconfigured remotely in each session
crosscompiler
Running optimization services
Prioritizer
Workload 1
Workload 2
Workload 3
ML-based cost model
…
Hardwarebitstream
RPC Communication Flow
!17
Client Tracker
Client Device
upload code
run code
return run time
Device
RPC Communication Flow
!17
Client Tracker
Client Device
upload code
run code
return run time
Device
device free
RPC Communication Flow
!17
Client Tracker
Client Device
upload code
run code
return run time
Device
device free
RPC Communication Flow
!17
Client Tracker
request device
Client Device
upload code
run code
return run time
Device
device free
RPC Communication Flow
!17
Client Tracker
request device
Client Device
upload code
run code
return run time
Device
device free
RPC Communication Flow
!17
Client Tracker
request device
Client Device
upload code
run code
return run time
Device
device free
RPC Communication Flow
!17
Client Tracker
request device
return handle
Client Device
upload code
run code
return run time
Device
device free
Model to Tuned Implementation
!18
Model Bag of Operatorsoperator extraction AutoTVM tuning Tuned
Model
Next: Autoscheduler, Lianmin @ 16:30
!19
conv2d, x86
conv2d, GPU, winograd
conv2d, ARM, spatial packing
Handcrafted Schedule Templates