Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...

Better Than All the Rest:Finding Maximum-Performance GPU Kernels

Using Auto-Tuning

Cedric Nugteren

GPU Technology ConferenceApril 7, 2016

HPC center of the Netherlands Augmented Reality Technology

How to find the best flight ticket?

Which airline?

Fly a day earlier or later?

Which route?

Solution:• Evaluate all combinations manually • … or use a flight comparison website

How to find the best flight ticket?

Which airline?

Fly a day earlier or later?

Which route?

Solution:• Evaluate all combinations manually • … or use a flight comparison websitethe CLTune auto-tuner

Cache in shared memory?

Thread block sizes?

More work per thread or less?

Example: convolution

Targets:• GPUs (CUDA & OpenCL) • Multi-core CPUs • Other OpenCL-capable devices

Example: blur filter

filter size: Xf by Yf

outputinputinput output

filter coefficients

example: 3 by 3 filter

OpenCL 2D convolution

double for-loop

each thread: one output pixel

Thread coarsening (2D)?

Unroll loops?

work-group / thread-block size?

Vector data-types?

Cache in local / shared memory?

Search-space explosion

16 x 2 x 16 x 4 x 5 = 10240

3424 configurations

filter illegalconfigurationsLarge search-space:

• Not feasible to explore manually • Perhaps not even feasible automatically?

Unroll loops?

Work-group / thread-block size?

Vector data-types?

Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?

Unroll loops?

Why do we need an auto-tuner?Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?

Unroll loops?

CLTune in actionExample input data

Verifies output based on optional reference kernel

Example: matrix-vector multiplication (y = Ax)

with tiling

Tile size (TS):4 possible values

Kernel to tune

CLTune in action

Tuneable tile-size TS

Handles all the host - device data transfers!

CLTune in action

Tuneable tile-size TS

Handles all the host - device data transfers!

Search strategiesOption 0: Full search☺ Finds optimal solution ☓Explores all options

est−

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

est−

device: K40m

● ●

searchspace

rotated histogram

3424 convolution kernels on Tesla K40m GPU

Search strategiesOption 0: Full searchOption 1: Random search☺ Explores arbitrary fraction ☓ Performance varies

Colours: 3 example runssearch progress (steps)

est−

0 20 40 60 80 100

random on K40m

Example: 107 out of 3424 configurations (1/32th)

Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima

search progress (steps)

est−

0 20 40 60 80 100

SA T=4 on K40m91%

Colours: 3 example runs

Search strategiesOption 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optim.☺ Explores arbitrary fraction ☓ Performance varies ☓ Meta-parameter ☓ Local optima

Colours: 3 example runssearch progress (steps)

est−

0 20 40 60 80 100

PSO S=3 on K40m

Line-types: 3 swarms15

Search strategies evaluationpe

t−kn

device: K40m

● ●

searchspace

Each search: 107 out of 3424 configurations (1/32th)

average best result of 128 searches

meta-parameters for SA and PSO

Conclusions:• PSO performs poorly • SA perform good, but not much better

than random search

[1]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.

Search strategies evaluationpe

t−kn

device: K40m

● ●

searchspace

est−

device: GTX480

●● ● ●

●●

searchspace

est−

device: HD7970

● ● ●● ●

searchspace

est−

device: Iris

● ● ● ●

● ●

searchspace

Solution: machine learning?

Machine learning an auto-tuner

Evaluate subset of all configurations

(e.g. 100 out of 3424)Train model with the examples

Predict execution time for all other configurations

Neural network (3-layer fully connected)

Linear regression

(take best 10 and evaluate them on actual hardware)

[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.

input: parameter configuration

output: execution time

Trained on a random subset of convolution example (1/32th)

Zoomed in

Trained on a random subset of convolution example (1/32th)

GEMM case-study

e−pe

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

2870 845

8823144

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

cuBLASN/A

Per thread tiling

Per workgroup tiling

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

GEMM case-study

e−pe

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

2870 845

8823144

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

cuBLASN/A

Per thread tiling

Per workgroup tiling

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

Convolution case-studype

ice−

3x3 filter 7x7 filter 11x11 filter

device: GTX480

26GFLOPS

GFLOPS

Conclusions:• Different best parameters for different:

• devices (see paper) • filter-sizes

• Performance equal or better than the state-of-the-art [3]

[3]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.

What about CUDA support?

Written for OpenCL, but …• Uses the high-level OpenCL API CLCudaAPI

• And CLCudaAPI also has a CUDA back-end!

• Switch between OpenCL and CUDA with a single include • Uses the CUDA 7.0 driver API and runtime compilation library (nvrtc)

Better Than All the Rest:Finding Maximum-Performance GPU Kernels Using Auto-Tuning

Auto-tuning GPU kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation

Case-studies:• Fastest 2D convolution • CLBlast matrix-multiplication

library perfo

e−pe

3x3 filter 7x7 filter 11x11 filter

device: GTX480

26GFLOPS

GFLOPS

Machine-learning:• Train a model on a small subset • Use the model to predict the remainder

Source-code on GitHub: https://github.com/CNugteren/CLTune

[More info]: C. Nugteren and V. Codreanu. CLTune: A Generic Auto-Tuner for OpenCL Kernels. IEEE MCSoC ’15.

Slides available @ www.cedricnugteren.nl

Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...

Documents

Transcript of Better Than All the Rest - GitHub PagesBetter Than All the Rest: Finding Maximum-Performance GPU...

Configuring Citrix XenServer ® 7.1 for Graphics · GPU sharing with NVIDIA GRID™ vGPU™ and Intel GVT-g™. 1.1. GPU Pass-Through Unlike the rest of the physical system components,

Multi-GPU Programming - GPU Technology Conference

Best Practices GPU-Based Video Processing | GTC 2013on-demand.gputechconf.com/...Based-Video-Processing... · GPU-Based Video Processing Optimal GPU Upload GPU Processing GPU Readback

GPU Benefits for Earth System Science › sites › default › files... · GPU Benefits for Earth System Science. 2 TOPICS ... 19km GPU 19km CPU 1.9km GPU.93km GPU ... NOAA FV3 GPU

Parallel Hybrid Computing · GPU GPU GPU GPU OpenMP HMPP MPI CUDA. Programming Multicores/ ... CILK, TBB, automatic parallelization, vectorization… • Distributed memory architectures

CS 380 - GPU and GPGPU Programming Lecture 4: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 4: GPU Architecture 3 Markus Hadwiger, KAUST

GPI-I ! MACHINE LEARNING NVIDIA GPU Tesla Accelerated ... edm1.pdf · TESLA ACCELERATED COMPUTING PLATFORM How GPU Acceleration Works Applocation Code 5% Of Code GPU Rest of Sequential

GPU Architecture: Implications & Trendss08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf · GPU Architecture: Implications & Trends ... GPU architecture increasingly centers

GPU-SM: Shared Memory Multi-GPU Programmingimpact.crhc.illinois.edu/shared/Papers/gpusm_gpgpu8.pdf · GPU-SM: Shared Memory Multi-GPU Programming Javier Cabezas Barcelona Supercomputing

GPU Accelerated Particle System for Triangulated Surface ...saahpc.ncsa.illinois.edu/10/papers/paper_41.pdflated to memory coalescing and thread occupancy. The rest of the section

GPU Physics - Nvidiadeveloper.download.nvidia.com/.../siggraph/gpu_physics-siggraph-06.pdf · NVIDIA GPU Physics Multi-GPU configurations, mixed or same GPU type One GPU does both

CLBlast: A Tuned OpenCL BLAS Libraryon-demand.gputechconf.com/.../s7280-cedric-nugteren... · Cedric Nugteren, TomTom CLBlast: Tuned OpenCL BLAS Slide 34 out of 43 Half-precision

20150225 alea gpu professional .net gpu development

The Evolution of Modern Parallel Computing · The Evolution of Modern Parallel Computing. 2 GPU Computing Is not vs. It is + CPU CPU GPU GPU. 3 GPU Computing Application Code + GPU

Extending Unified Parallel C for GPU Computing · PGAS Programming Model for Hybrid Multi-Core Systems Computer Node CPU Memory GPU GPU Memory CPU CPU GPU GPU Memory Computer Node

GPU Architecture and Programming. GPU vs CPU .

Cygnus: GPU meets FPGA for HPC - RIKEN R-CCS · 2020. 2. 27. · FPGA-GPU DMA (FPGA ← GPU) FPGA-GPU DMA (FPGA → GPU) direction via CPU FPGA-GPU DMA GPU→FPGA 17 1.44 FPGA→GPU

Property of American Airlines - Amazon S3 · GPU - 409 INFORMATION 1. Unit Description . A. GPU-406 and GPU-409 The GPU-406 and GPU-409 are self-contained, diesel enginedriven, ground

D. Schoerling FCC Week Washington L. Bottura, J. van Nugteren, M. Karppinen 26/03/2015 1.

Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2