Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...

Post on 27-Sep-2020

5 views 0 download

Transcript of Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...

Better Than All the Rest:Finding Maximum-Performance GPU Kernels

Using Auto-Tuning

Cedric Nugteren

GPU Technology ConferenceApril 7, 2016

HPC center of the Netherlands Augmented Reality Technology

How to find the best flight ticket?

2

Which airline?

Fly a day earlier or later?

Which route?

Solution:• Evaluate all combinations manually • … or use a flight comparison website

How to find the best flight ticket?

3

Which airline?

Fly a day earlier or later?

Which route?

Solution:• Evaluate all combinations manually • … or use a flight comparison websitethe CLTune auto-tuner

Cache in shared memory?

Thread block sizes?

More work per thread or less?

Example: convolution

4

Targets:• GPUs (CUDA & OpenCL) • Multi-core CPUs • Other OpenCL-capable devices

Example: blur filter

X

Y Yf

Xf A

X

B

filter size: Xf by Yf

outputinputinput output

filter coefficients

example: 3 by 3 filter

OpenCL 2D convolution

double for-loop

each thread: one output pixel

Thread coarsening (2D)?

Unroll loops?

work-group / thread-block size?

Vector data-types?

Cache in local / shared memory?

5

Search-space explosion

16 x 2 x 16 x 4 x 5 = 10240

3424 configurations

filter illegalconfigurationsLarge search-space:

• Not feasible to explore manually • Perhaps not even feasible automatically?

Thread coarsening (2D)?

Unroll loops?

Work-group / thread-block size?

Vector data-types?

Cache in local / shared memory?

6

Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?

Thread coarsening (2D)?

Unroll loops?

Work-group / thread-block size?

Cache in local / shared memory?

X

Y Yf

Xf A

X

B

7

Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?

Thread coarsening (2D)?

Unroll loops?

Work-group / thread-block size?

Cache in local / shared memory?

X

Y Yf

Xf A

X

B

8

CLTune in actionExample input data

Verifies output based on optional reference kernel

9

Example: matrix-vector multiplication (y = Ax)

with tiling

Tile size (TS):4 possible values

Kernel to tune

CLTune in action

10

Tuneable tile-size TS

Handles all the host - device data transfers!

CLTune in action

11

Tuneable tile-size TS

Handles all the host - device data transfers!

Search strategiesOption 0: Full search☺ Finds optimal solution ☓Explores all options

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

mean

rotated histogram

12

3424 convolution kernels on Tesla K40m GPU

Search strategiesOption 0: Full searchOption 1: Random search☺ Explores arbitrary fraction ☓ Performance varies

Colours: 3 example runssearch progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

random on K40m

56%

94%

74%

Example: 107 out of 3424 configurations (1/32th)

13

Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima

search progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

SA T=4 on K40m91%

64%

84%

Example: 107 out of 3424 configurations (1/32th)

Colours: 3 example runs

14

Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optim.☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima

Colours: 3 example runssearch progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

PSO S=3 on K40m

43%

66%

44%

Example: 107 out of 3424 configurations (1/32th)

Line-types: 3 swarms15

Search strategies evaluationpe

rform

ance

[% o

f bes

t−kn

own]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

Each search: 107 out of 3424 configurations (1/32th)

average best result of 128 searches

meta-parameters for SA and PSO

16

Conclusions:• PSO performs poorly • SA perform good, but not much better

than random search

[1]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.

Search strategies evaluationpe

rform

ance

[% o

f bes

t−kn

own]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: GTX480

●● ● ●

●●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: HD7970

● ● ●● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: Iris

● ● ● ●

● ●

searchspace

17

Solution: machine learning?

18

Machine learning an auto-tuner

19

Evaluate subset of all configurations

(e.g. 100 out of 3424)Train model with the examples

Predict execution time for all other configurations

Neural network (3-layer fully connected)

Linear regression

(take best 10 and evaluate them on actual hardware)

[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.

input: parameter configuration

output: execution time

Machine learning an auto-tuner

20

Trained on a random subset of convolution example (1/32th)

Zoomed in

Machine learning an auto-tuner

21

Zoomed in

Trained on a random subset of convolution example (1/32th)

GEMM case-study

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

1933

1197

2870 845

318

8823144

2409

240

56

GFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPS

GFLOPS

cuBLASN/A

cuBLASN/A

X

C

+=

Mwg

M

Mwi

KwgK

Nwg

N

Nwi

N

Mwg

Nwg

Nwi

M

Mwi

Per thread tiling

Per workgroup tiling

BAT

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

22

GEMM case-study

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

1933

1197

2870 845

318

8823144

2409

240

56

GFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPS

GFLOPS

cuBLASN/A

cuBLASN/A

X

C

+=

Mwg

M

Mwi

KwgK

Nwg

N

Nwi

N

Mwg

Nwg

Nwi

M

Mwi

Per thread tiling

Per workgroup tiling

BAT

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

23

Convolution case-studype

rform

ance

[% o

f dev

ice−

peak

]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

Conclusions:• Different best parameters for different:

• devices (see paper) • filter-sizes

• Performance equal or better than the state-of-the-art [3]

[3]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.

24

What about CUDA support?

25

Written for OpenCL, but …• Uses the high-level OpenCL API CLCudaAPI

• And CLCudaAPI also has a CUDA back-end!

• Switch between OpenCL and CUDA with a single include • Uses the CUDA 7.0 driver API and runtime compilation library (nvrtc)

Better Than All the Rest:Finding Maximum-Performance GPU Kernels Using Auto-Tuning

Auto-tuning GPU kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation

Case-studies:• Fastest 2D convolution • CLBlast matrix-multiplication

library perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

Machine-learning:• Train a model on a small subset • Use the model to predict the remainder

Source-code on GitHub: https://github.com/CNugteren/CLTune

26

[More info]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.

Slides available @ www.cedricnugteren.nl