Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...

26
Better Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC center of the Netherlands Augmented Reality Technology

Transcript of Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...

Page 1: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Better Than All the Rest:Finding Maximum-Performance GPU Kernels

Using Auto-Tuning

Cedric Nugteren

GPU Technology ConferenceApril 7, 2016

HPC center of the Netherlands Augmented Reality Technology

Page 2: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

How to find the best flight ticket?

2

Which airline?

Fly a day earlier or later?

Which route?

Solution:• Evaluate all combinations manually • … or use a flight comparison website

Page 3: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

How to find the best flight ticket?

3

Which airline?

Fly a day earlier or later?

Which route?

Solution:• Evaluate all combinations manually • … or use a flight comparison websitethe CLTune auto-tuner

Cache in shared memory?

Thread block sizes?

More work per thread or less?

Page 4: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Example: convolution

4

Targets:• GPUs (CUDA & OpenCL) • Multi-core CPUs • Other OpenCL-capable devices

Example: blur filter

X

Y Yf

Xf A

X

B

filter size: Xf by Yf

outputinputinput output

filter coefficients

example: 3 by 3 filter

Page 5: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

OpenCL 2D convolution

double for-loop

each thread: one output pixel

Thread coarsening (2D)?

Unroll loops?

work-group / thread-block size?

Vector data-types?

Cache in local / shared memory?

5

Page 6: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Search-space explosion

16 x 2 x 16 x 4 x 5 = 10240

3424 configurations

filter illegalconfigurationsLarge search-space:

• Not feasible to explore manually • Perhaps not even feasible automatically?

Thread coarsening (2D)?

Unroll loops?

Work-group / thread-block size?

Vector data-types?

Cache in local / shared memory?

6

Page 7: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?

Thread coarsening (2D)?

Unroll loops?

Work-group / thread-block size?

Cache in local / shared memory?

X

Y Yf

Xf A

X

B

7

Page 8: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?

Thread coarsening (2D)?

Unroll loops?

Work-group / thread-block size?

Cache in local / shared memory?

X

Y Yf

Xf A

X

B

8

Page 9: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

CLTune in actionExample input data

Verifies output based on optional reference kernel

9

Example: matrix-vector multiplication (y = Ax)

with tiling

Tile size (TS):4 possible values

Kernel to tune

Page 10: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

CLTune in action

10

Tuneable tile-size TS

Handles all the host - device data transfers!

Page 11: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

CLTune in action

11

Tuneable tile-size TS

Handles all the host - device data transfers!

Page 12: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Search strategiesOption 0: Full search☺ Finds optimal solution ☓Explores all options

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

mean

rotated histogram

12

3424 convolution kernels on Tesla K40m GPU

Page 13: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Search strategiesOption 0: Full searchOption 1: Random search☺ Explores arbitrary fraction ☓ Performance varies

Colours: 3 example runssearch progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

random on K40m

56%

94%

74%

Example: 107 out of 3424 configurations (1/32th)

13

Page 14: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima

search progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

SA T=4 on K40m91%

64%

84%

Example: 107 out of 3424 configurations (1/32th)

Colours: 3 example runs

14

Page 15: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optim.☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima

Colours: 3 example runssearch progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

PSO S=3 on K40m

43%

66%

44%

Example: 107 out of 3424 configurations (1/32th)

Line-types: 3 swarms15

Page 16: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Search strategies evaluationpe

rform

ance

[% o

f bes

t−kn

own]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

Each search: 107 out of 3424 configurations (1/32th)

average best result of 128 searches

meta-parameters for SA and PSO

16

Conclusions:• PSO performs poorly • SA perform good, but not much better

than random search

[1]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.

Page 17: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Search strategies evaluationpe

rform

ance

[% o

f bes

t−kn

own]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: GTX480

●● ● ●

●●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: HD7970

● ● ●● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: Iris

● ● ● ●

● ●

searchspace

17

Page 18: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Solution: machine learning?

18

Page 19: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Machine learning an auto-tuner

19

Evaluate subset of all configurations

(e.g. 100 out of 3424)Train model with the examples

Predict execution time for all other configurations

Neural network (3-layer fully connected)

Linear regression

(take best 10 and evaluate them on actual hardware)

[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.

input: parameter configuration

output: execution time

Page 20: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Machine learning an auto-tuner

20

Trained on a random subset of convolution example (1/32th)

Zoomed in

Page 21: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Machine learning an auto-tuner

21

Zoomed in

Trained on a random subset of convolution example (1/32th)

Page 22: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

GEMM case-study

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

1933

1197

2870 845

318

8823144

2409

240

56

GFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPS

GFLOPS

cuBLASN/A

cuBLASN/A

X

C

+=

Mwg

M

Mwi

KwgK

Nwg

N

Nwi

N

Mwg

Nwg

Nwi

M

Mwi

Per thread tiling

Per workgroup tiling

BAT

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

22

Page 23: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

GEMM case-study

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

1933

1197

2870 845

318

8823144

2409

240

56

GFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPS

GFLOPS

cuBLASN/A

cuBLASN/A

X

C

+=

Mwg

M

Mwi

KwgK

Nwg

N

Nwi

N

Mwg

Nwg

Nwi

M

Mwi

Per thread tiling

Per workgroup tiling

BAT

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

23

Page 24: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Convolution case-studype

rform

ance

[% o

f dev

ice−

peak

]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

Conclusions:• Different best parameters for different:

• devices (see paper) • filter-sizes

• Performance equal or better than the state-of-the-art [3]

[3]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.

24

Page 25: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

What about CUDA support?

25

Written for OpenCL, but …• Uses the high-level OpenCL API CLCudaAPI

• And CLCudaAPI also has a CUDA back-end!

• Switch between OpenCL and CUDA with a single include • Uses the CUDA 7.0 driver API and runtime compilation library (nvrtc)

Page 26: Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU Kernels Using Auto-Tuning Cedric Nugteren GPU Technology Conference April 7, 2016 HPC

Better Than All the Rest:Finding Maximum-Performance GPU Kernels Using Auto-Tuning

Auto-tuning GPU kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation

Case-studies:• Fastest 2D convolution • CLBlast matrix-multiplication

library perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

Machine-learning:• Train a model on a small subset • Use the model to predict the remainder

Source-code on GitHub: https://github.com/CNugteren/CLTune

26

[More info]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.

Slides available @ www.cedricnugteren.nl