GPU-to-GPU and Host-to-Host Multipattern String Matching on a GPU
Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...
Transcript of Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...
Better Than All the Rest:Finding Maximum-Performance GPU Kernels
Using Auto-Tuning
Cedric Nugteren
GPU Technology ConferenceApril 7, 2016
HPC center of the Netherlands Augmented Reality Technology
How to find the best flight ticket?
2
Which airline?
Fly a day earlier or later?
Which route?
Solution:• Evaluate all combinations manually • … or use a flight comparison website
How to find the best flight ticket?
3
Which airline?
Fly a day earlier or later?
Which route?
Solution:• Evaluate all combinations manually • … or use a flight comparison websitethe CLTune auto-tuner
Cache in shared memory?
Thread block sizes?
More work per thread or less?
Example: convolution
4
Targets:• GPUs (CUDA & OpenCL) • Multi-core CPUs • Other OpenCL-capable devices
Example: blur filter
X
Y Yf
Xf A
X
B
filter size: Xf by Yf
outputinputinput output
filter coefficients
example: 3 by 3 filter
OpenCL 2D convolution
double for-loop
each thread: one output pixel
Thread coarsening (2D)?
Unroll loops?
work-group / thread-block size?
Vector data-types?
Cache in local / shared memory?
5
Search-space explosion
16 x 2 x 16 x 4 x 5 = 10240
3424 configurations
filter illegalconfigurationsLarge search-space:
• Not feasible to explore manually • Perhaps not even feasible automatically?
Thread coarsening (2D)?
Unroll loops?
Work-group / thread-block size?
Vector data-types?
Cache in local / shared memory?
6
Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?
Wide variety of devices:• Different optimal kernels • Even from the same vendor
User-parameter dependent:• Examples: matrix sizes,
image size, filter sizes, etc.
Vector data-types?
Thread coarsening (2D)?
Unroll loops?
Work-group / thread-block size?
Cache in local / shared memory?
X
Y Yf
Xf A
X
B
7
Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?
Wide variety of devices:• Different optimal kernels • Even from the same vendor
User-parameter dependent:• Examples: matrix sizes,
image size, filter sizes, etc.
Vector data-types?
Thread coarsening (2D)?
Unroll loops?
Work-group / thread-block size?
Cache in local / shared memory?
X
Y Yf
Xf A
X
B
8
CLTune in actionExample input data
Verifies output based on optional reference kernel
9
Example: matrix-vector multiplication (y = Ax)
with tiling
Tile size (TS):4 possible values
Kernel to tune
CLTune in action
10
Tuneable tile-size TS
Handles all the host - device data transfers!
CLTune in action
11
Tuneable tile-size TS
Handles all the host - device data transfers!
Search strategiesOption 0: Full search☺ Finds optimal solution ☓Explores all options
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
mean
rotated histogram
12
3424 convolution kernels on Tesla K40m GPU
Search strategiesOption 0: Full searchOption 1: Random search☺ Explores arbitrary fraction ☓ Performance varies
Colours: 3 example runssearch progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
random on K40m
56%
94%
74%
Example: 107 out of 3424 configurations (1/32th)
13
Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima
search progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
SA T=4 on K40m91%
64%
84%
Example: 107 out of 3424 configurations (1/32th)
Colours: 3 example runs
14
Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optim.☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima
Colours: 3 example runssearch progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
PSO S=3 on K40m
43%
66%
44%
Example: 107 out of 3424 configurations (1/32th)
Line-types: 3 swarms15
Search strategies evaluationpe
rform
ance
[% o
f bes
t−kn
own]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
Each search: 107 out of 3424 configurations (1/32th)
average best result of 128 searches
meta-parameters for SA and PSO
16
Conclusions:• PSO performs poorly • SA perform good, but not much better
than random search
[1]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.
Search strategies evaluationpe
rform
ance
[% o
f bes
t−kn
own]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: GTX480
●● ● ●
●●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: HD7970
●
● ● ●● ●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: Iris
● ● ● ●
● ●
●
searchspace
17
Solution: machine learning?
18
Machine learning an auto-tuner
19
Evaluate subset of all configurations
(e.g. 100 out of 3424)Train model with the examples
Predict execution time for all other configurations
Neural network (3-layer fully connected)
Linear regression
(take best 10 and evaluate them on actual hardware)
[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.
input: parameter configuration
output: execution time
Machine learning an auto-tuner
20
Trained on a random subset of convolution example (1/32th)
Zoomed in
Machine learning an auto-tuner
21
Zoomed in
Trained on a random subset of convolution example (1/32th)
GEMM case-study
perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
K40m GTX480 HD7970 Iris
GEMM performance comparison this workclBLAScuBLAS
1933
1197
2870 845
318
8823144
2409
240
56
GFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPS
GFLOPS
cuBLASN/A
cuBLASN/A
X
C
+=
Mwg
M
Mwi
KwgK
Nwg
N
Nwi
N
Mwg
Nwg
Nwi
M
Mwi
Per thread tiling
Per workgroup tiling
BAT
Unrolled by a factor Kwi
Mwi x MdimC = Mwg
Conclusions:• Different best parameters for
different devices • Performance better than clBLAS,
but not as good as assembly-tuned cuBLAS
22
GEMM case-study
perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
K40m GTX480 HD7970 Iris
GEMM performance comparison this workclBLAScuBLAS
1933
1197
2870 845
318
8823144
2409
240
56
GFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPS
GFLOPS
cuBLASN/A
cuBLASN/A
X
C
+=
Mwg
M
Mwi
KwgK
Nwg
N
Nwi
N
Mwg
Nwg
Nwi
M
Mwi
Per thread tiling
Per workgroup tiling
BAT
Unrolled by a factor Kwi
Mwi x MdimC = Mwg
Conclusions:• Different best parameters for
different devices • Performance better than clBLAS,
but not as good as assembly-tuned cuBLAS
23
Convolution case-studype
rform
ance
[% o
f dev
ice−
peak
]
0
20
40
60
80
100
3x3 filter 7x7 filter 11x11 filter
device: GTX480
298
125
572
46
787
26GFLOPS
GB/s
GFLOPS
GB/s
GFLOPS
GB/s
Conclusions:• Different best parameters for different:
• devices (see paper) • filter-sizes
• Performance equal or better than the state-of-the-art [3]
[3]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.
24
What about CUDA support?
25
Written for OpenCL, but …• Uses the high-level OpenCL API CLCudaAPI
• And CLCudaAPI also has a CUDA back-end!
• Switch between OpenCL and CUDA with a single include • Uses the CUDA 7.0 driver API and runtime compilation library (nvrtc)
Better Than All the Rest:Finding Maximum-Performance GPU Kernels Using Auto-Tuning
Auto-tuning GPU kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation
Case-studies:• Fastest 2D convolution • CLBlast matrix-multiplication
library perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
3x3 filter 7x7 filter 11x11 filter
device: GTX480
298
125
572
46
787
26GFLOPS
GB/s
GFLOPS
GB/s
GFLOPS
GB/s
Machine-learning:• Train a model on a small subset • Use the model to predict the remainder
Source-code on GitHub: https://github.com/CNugteren/CLTune
26
[More info]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.
Slides available @ www.cedricnugteren.nl