An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for...

61
An Introduction to OpenCL for Scientific Computing Florian Rudolf 1 , Karl Rupp 1,2 , Josef Weinbub 1 http://karlrupp.net/ 1 Institute for Microelectronics 2 Institute for Analysis and Scientific Computing Technische Universit¨ at Wien, Austria Workshop Programming of Heterogeneous Systems in Physics July 14th, 2014

Transcript of An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for...

Page 1: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

An Introduction to OpenCLfor Scientific Computing

Florian Rudolf1, Karl Rupp1,2, Josef Weinbub1

http://karlrupp.net/

1Institute for Microelectronics2Institute for Analysis and Scientific Computing

Technische Universitat Wien, Austria

Workshop Programming of Heterogeneous Systems in Physics

July 14th, 2014

Page 2: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

2

IntroductionIntroduction

Positions

PhD student at TU Wien (2009-2011)

Postdoc at Argonne Natl. Lab. (09/2012-09/2013)

Postdoc at TU Wien (09/2013-current)

Research Interests

Semiconductor device simulation

Numerical solution of PDEs

Parallel computing

Software Development

PETSc

ViennaCL

ViennaSHE

...

Page 3: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

3

100x Speed-Up?100x Speed-Up?

Proc. ISCA 2010

Page 4: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

4

100x Speed-Up?100x Speed-Up?

102

103

104

2007 2008 2009 2010 2011 2012 2013

GF

LO

P/s

ec

End of Year

Theoretical Peak Performance, Double Precision

CPUs, Intel

Xeon X5482Xeon X5492

Xeon W5590Xeon X5680

Xeon X5690

Xeon E5-2690 Xeon E5-2697 v2

GPUs, NVIDIA

GPUs, AMD

MIC, Intel

Tesla C1060

Tesla C1060

Tesla C2050 Tesla C2090Radeon HD 7970 GHz Ed.

Radeon HD 8970

Radeon HD 3870

Radeon HD 4870

Radeon HD 5870Radeon HD 6970

Radeon HD 6970Tesla K20

Tesla K20X

Xeon Phi X7120X

Page 5: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

5

100x Speed-Up?100x Speed-Up?

101

102

103

2007 2008 2009 2010 2011 2012 2013

GB

/se

c

End of Year

Peak Memory Bandwidth Comparison

CPUs, Intel

Xeon X5482 Xeon X5492

Xeon W5590 Xeon X5680 Xeon X5690

Xeon E5-2690

Xeon E5-2697 v2

GPUs, NVIDIA

GPUs, AMD

MIC, Intel

Radeon HD 3870

GeForce G

TX 280

GeForce G

TX 285

GeForce G

TX 580

GeForce G

TX 580

Radeon HD 7970 G

Hz Ed.

GeForce G

TX Tita

n

GeForce 8800 G

TSRadeon H

D 4870

Radeon HD 5870

Radeon HD 6970

Radeon HD 6970 GeForce

GTX 680

Xeon Phi 7

120XRadeon H

D 8970

Page 6: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

6

100x Speed-Up?100x Speed-Up?

1

10

2007 2008 2009 2010 2011 2012 2013

FL

OP

pe

r B

yte

End of Year

Floating Point Operations per Byte, Double Precision

Radeon HD 3870

Radeon HD 4870

Radeon HD 5870

Radeon HD 6970

Radeon HD 6970

Radeon HD 7970 GHz Ed.

Radeon HD 8970

Tesla C1060

Tesla C1060

Xeon X5680

Xeon X5690

Tesla K20

Tesla K20X

CPUs, Intel

Xeon X5482

Xeon X5492

Xeon W5590

Tesla C2050

Tesla C2090Xeon E5-2690

Xeon E5-2697 v2

GPUs, NVIDIA

GPUs, AMD

MIC, Intel

Xeon Phi X7120X

Page 7: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

7

GPUs: DisillusionGPUs: Disillusion

Computing Architecture Schematic

Memory

PCI Express

CPU GPU

Memory

Page 8: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

8

GPUs: DisillusionGPUs: Disillusion

Computing Architecture Schematic

PCI Express

8x20 GB/s2x12 GB/s

CPU GPU

8 GB/s, ~1us Latency

1000 GFLOPs SP 250 GFLOPs DP

100 GFLOPs SP 50 GFLOPs DP

Good for large FLOP-intensive tasks, high memory bandwidth

PCI-Express can be a bottleneck

� 10-fold speedups (usually) not backed by hardware

Page 9: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

9

Table of ContentsTable of Contents

100x Speedup!?

About OpenCL

OpenCL Execution Model

OpenCL Kernel Language

Comparison with CUDA and OpenACC

Portable Performance

Summary

Page 10: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

10

History of OpenCLHistory of OpenCL

Prior to 2008

OpenCL developed by Apple Inc.

2008

OpenCL working group formed at Khronos Group

OpenCL specification 1.0 released

2010

OpenCL 1.1 (multi-device, subbuffer manipulation)

2011

OpenCL 1.2 (device partitioning)

2013

OpenCL 2.0 (shared virtual memory, SPIR, etc.)

Page 11: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

11

About OpenCLAbout OpenCL

Advantages

Not restricted to a single vendor: Intel, NVIDIA, AMD, ...

Just a shared C-library

Does not rely on compiler magic

Disadvantages

Not restricted to a single vendor

Boilerplate code required

Portable performance?

Page 12: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

11

About OpenCLAbout OpenCL

Advantages

Not restricted to a single vendor: Intel, NVIDIA, AMD, ...

Just a shared C-library

Does not rely on compiler magic

Disadvantages

Not restricted to a single vendor

Boilerplate code required

Portable performance?

Page 13: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

12

Table of ContentsTable of Contents

100x Speedup!?

About OpenCL

OpenCL Execution Model

OpenCL Kernel Language

Comparison with CUDA and OpenACC

Portable Performance

Summary

Page 14: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

13

OpenCL Platform ModelOpenCL Platform Model

Platform 1

Context 1

Kernel 2,1

Kernel 2,2

Kernel 1,n

Kernel 1,2

Kernel 1,1

Kernel 2,m

Memory Buffer 1

Memory Buffer 2

Device 1 Device 2

Device List

Memory Buffer k

Command Queue 2

Command Queue 1Program 2

Device 3

Program 1

Page 15: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

14

OpenCL Platform ModelOpenCL Platform Model

https://github.com/karlrupp/phsp2014

Page 16: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

15

OpenCL Host APIOpenCL Host API

// Setup contet and queuecontext = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);queue = clCreateCommandQueue(context, NULL, 0, NULL);

// Create memory buffersmemobjs[0] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries

, NULL, NULL);memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,

sizeof(float)*2*num_entries, srcA, NULL);

// Create OpenCL program and kernelsprogram = clCreateProgramWithSource(context, 1, &kernel_src, NULL, NULL);clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

kernel = clCreateKernel(program, "my_kernel", NULL);clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]);clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]);clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL);

// Run an OpenCL kernel:global_work_size[0] = 128;local_work_size[0] = 64;

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size,0, NULL, NULL);

Issues

“Where is the error?”

Manage OpenCL handles

Page 17: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

16

Table of ContentsTable of Contents

100x Speedup!?

About OpenCL

OpenCL Execution Model

OpenCL Kernel Language

Comparison with CUDA and OpenACC

Portable Performance

Summary

Page 18: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

17

OpenCL Kernel LanguageOpenCL Kernel Language

Primer: OpenCL Device Execution Model

http://developer.amd.com/documentation/articles/PublishingImages/opencl figure5.jpg

Page 19: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

18

OpenCL Kernel LanguageOpenCL Kernel Language

Sample Operation: Inplace Vector Additionx1

x2

...xn

+=

y1

y2

...yn

OpenCL Kernel

__kernel void inplace_add(__global const float * vec1,__global const float * vec2,unsigned int size)

{for (unsigned int i = get_global_id(0);

i < size;i += get_global_size(0))

x[i] += y[i];}

Page 20: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

19

OpenCL Kernel LanguageOpenCL Kernel Language

Just Like C

Datatypes: float, double, long, ... , float2, float4, ...

Keywords: for, if, switch, ...

...

Thread Management

ID: get_local_id(dim), get_global_id(dim)

Count: get_local_size(dim), get_global_size(dim)

Sync: barrier(flags), mem_fence(flags)

Memory Qualifiers

__global: Global memory (e.g. GPU RAM)

__constant: Constant global memory (vendor-specific!)

__local: Shared memory (per workgroup)

Page 21: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

20

OpenCL Kernel LanguageOpenCL Kernel Language

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

Page 22: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

20

OpenCL Kernel LanguageOpenCL Kernel Language

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Page 23: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

20

OpenCL Kernel LanguageOpenCL Kernel Language

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Parameters (� 1000 variations possible)

Local work size, global work size

Vector types (float1, float2, ... , float16)

Thread increment type

Page 24: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

20

OpenCL Kernel LanguageOpenCL Kernel Language

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Parameters (� 1000 variations possible)

for (size t i = get global id(0); i < N;

i+= get global size(0))

x[i] = y[i];

Page 25: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

21

OpenCL Kernel LanguageOpenCL Kernel Language

No Synchronization: Vector Addition

x

z

y

Synchronization: Dot Product

z

y

a

Page 26: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

22

Table of ContentsTable of Contents

100x Speedup!?

About OpenCL

OpenCL Execution Model

OpenCL Kernel Language

Comparison with CUDA and OpenACC

Portable Performance

Summary

Page 27: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

23

Comparison with CUDA and OpenACCComparison with CUDA and OpenACC

NVIDIA CUDA

// GPU kernel:__global__ void kernel(double *buffer){

int idx = blockIdx.x * blockDim.x + threadIdx.x;buffer[idx] = 42.0;

}

// host code:int main(){

...cudaMalloc((void**)&buffer,size);kernel<<<blocknum, blockdim>>>(buffer);...

}

Almost no additional code required

Vendor-lock

Relies on nvcc being available (plus version conflicts...)

Page 28: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

24

Comparison with CUDA and OpenACCComparison with CUDA and OpenACC

OpenCL

const char *kernel_string ="__kernel void mykernel(__global double *buffer) {

buffer[get_global_id(0)] = 42.0;};";

int main() {...cl_program my_prog = clCreateProgramWithSource(

my_context,1,&kernel_string,&source_len,&err);clBuildProgram(my_prog,0,NULL,NULL,NULL,NULL);cl_kernel my_kernel = clCreateKernel(my_prog,

"mykernel",&err);clSetKernelArg(my_kernel,0,sizeof(cl_mem),&buffer);clEnqueueNDRangeKernel(queue,my_kernel,1,NULL,

&global_size,&local_size,0,NULL,NULL);}

Additional boilerplate code required (low-level API)

Broad hardware support (separate SDKs)

Second-best programming model for each vendor

Page 29: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

25

Comparison with CUDA and OpenACCComparison with CUDA and OpenACC

OpenACC

void func(...) {#pragma acc data pcopyin(A[0:size][0:size]){#pragma acc kernels loopfor(int i=0; i < size; i++)

for(int j=0; j < size; j++)A[i][j] = 42;

}}

int main(){

double A[1337][1337];func(A);

}

Simple OpenMP-type pragma annotations

Compiler support?

Insufficient control over memory transfers?

Page 30: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

26

Comparison with CUDA and OpenACCComparison with CUDA and OpenACC

Thread Explosion Problem

void func(double *A) {#pragma omp parallel forfor(int i=0; i<1337; i++) A[i] = 42.0;

}

void func2(double *A) {#pragma omp parallel forfor(int i=0; i<1337; i++) func(A);

}

int main() {double A[1337];func(A); // okayfunc2(A); // boom...

}

Usually not as obvious as here

Problem when OpenMP’ing MPI code

Headache for library implementors

Page 31: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

27

Table of ContentsTable of Contents

100x Speedup!?

About OpenCL

OpenCL Execution Model

OpenCL Kernel Language

Comparison with CUDA and OpenACC

Portable Performance

Summary

Page 32: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

28

BenchmarksBenchmarks

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

101

102

103

104

105

106

107

Execution T

ime (

sec)

Vector Size

Vector Addition x = y + z

NVIDIA GTX 285, CUDANVIDIA GTX 285, OpenCL

AMD Radeon HD 7970, OpenCLIntel Xeon Phi Beta, OpenCL

Intel Xeon Phi Beta, nativeIntel Xeon X5550, OpenMP

Intel Xeon X5550, single-threaded

Page 33: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

29

BenchmarksBenchmarks

10-5

10-4

10-3

10-2

10-1

100

101

101

102

103

104

105

106

107

Execution T

ime (

sec)

Number of Unknowns

50 CG Iterations (2D FD for Poisson)

NVIDIA GTX 285, CUDANVIDIA GTX 285, OpenCL

AMD Radeon HD 7970, OpenCLIntel Xeon Phi Beta, OpenCL

Intel Xeon Phi Beta, nativeIntel Xeon X5550, OpenMP

Intel Xeon X5550, single-threaded

Page 34: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

30

Portable PerformancePortable Performance

Scope for Portability Study

Vector and matrix-vector operations (BLAS levels 1 and 2)

Limited by memory bandwidth

Key Question (Memory-Bandwidth-Limited Kernels)

Good performance of complicated kernelsby optimizing the simplest kernel?

Page 35: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

31

Portable PerformancePortable Performance

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

Page 36: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

31

Portable PerformancePortable Performance

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Page 37: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

31

Portable PerformancePortable Performance

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Parameters (1900 variations)

Local work size, global work size

Vector types (float1, float2, ... , float16)

Thread increment type

Page 38: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

31

Portable PerformancePortable Performance

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Parameters (1900 variations)

for (size t i = get global id(0); i < N;

i+= get global size(0))

x[i] = y[i];

Page 39: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

31

Portable PerformancePortable Performance

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Parameters (1900 variations)

for (size t i = group start + get local id(0);

i < group end;

i+= get local size(0)) x[i] = y[i];

Page 40: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

31

Portable PerformancePortable Performance

Vector Assignment (Copy) Kernel

x⇐ y for (large) vectors x, y

x

y

Parameters (1900 variations)

for (size t i = group start + get local id(0);

i < group end; i+= get local size(0))

x[i] = y[i];

Page 41: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

32

Portable PerformancePortable Performance

Operations

Vector copy, vector addition, inner product

Matrix-vector product

x

z

y

Devices

AMD: Radeon HD 5850, FirePro W9000

INTEL: Dual Socket Xeon E5-2670, Xeon Phi

NVIDIA: GTX 285, Tesla K20m

Page 42: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

32

Portable PerformancePortable Performance

Operations

Vector copy, vector addition, inner product

Matrix-vector product

z

y

a

Devices

AMD: Radeon HD 5850, FirePro W9000

INTEL: Dual Socket Xeon E5-2670, Xeon Phi

NVIDIA: GTX 285, Tesla K20m

Page 43: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

32

Portable PerformancePortable Performance

Operations

Vector copy, vector addition, inner product

Matrix-vector product

z

y

a

Devices

AMD: Radeon HD 5850, FirePro W9000

INTEL: Dual Socket Xeon E5-2670, Xeon Phi

NVIDIA: GTX 285, Tesla K20m

Page 44: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

33

Portable PerformancePortable Performance

Histograms

Page 45: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

34

Portable PerformancePortable Performance

5 15 25 35 45 55 65 75 85 95

AMD FirePro W9000

Bandwidth Inner Product (% of peak)

010

020

030

040

0 Increment by Local Work SizeIncrement by Global Work Size

Page 46: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

35

Portable PerformancePortable Performance

5 15 25 35 45 55 65 75 85 95

NVIDIA Tesla K20m

Bandwidth Inner Product (% of peak)

010

020

030

040

0 Increment by Local Work SizeIncrement by Global Work Size

Page 47: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

36

Portable PerformancePortable Performance

5 15 25 35 45 55 65 75 85 95

2x INTEL Xeon E5−2670

Bandwidth Inner Product (% of peak)

010

020

030

040

0 Increment by Local Work SizeIncrement by Global Work Size

Page 48: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

37

Portable PerformancePortable Performance

5 15 25 35 45 55 65 75 85 95

2x INTEL Xeon E5−2670

Bandwidth Inner Product (% of peak)

010

020

030

040

0 Increment by Local Work SizeIncrement by Global Work Size

x

y

Page 49: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

38

Portable PerformancePortable Performance

[Addition|Inner Product|Matrix-Vector] vs. Copy Kernel

Same Device

Page 50: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

39

Portable PerformancePortable Performance

Page 51: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

39

Portable PerformancePortable Performance

Page 52: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

39

Portable PerformancePortable Performance

Page 53: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

40

Portable PerformancePortable Performance

NVIDIA Tesla K20m

Addition Inner Product Mat-Vec Product

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●

●●●●●

●●

●●

●●●

●●

●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●

●●

●●●●

●●●●

●●

●● ●

●● ●●●

●●●

●● ●●●

●●●●●

●●

●●●●●●

●●●●●

●●

●●

●●

●●●●●

●●●●●●●

●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●●●●●●

●●●●●

●●

●●

●●●

●●●

●●●●

●●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●●

●● ●●

●●

●●

●●●●●●

● ●

●●●●●

●●

● ●●●●●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●

●●●●

●●●

●●●●●

●●●●●

●●

●●

●●●

●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●● ●

●●

● ●●●●

●●●

●● ●●●

●●●●●

●●

●●●●●●

●●●●●●

●●

● ● ●●●●●

●●●●●

●●

●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●

●●●●●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●●●

●●●

●●

● ●●

●●

● ●●

●●

●●

●●● ●● ●

● ●

● ●●

●●

● ●●●●

●●●●●

●●●

● ●●

●●●

●●

●●●●●●●

● ●●●

●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●●

● ●●●●●

●● ●●

●●

●●

●●●●●●●

●●●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●● ●

●●●●

●●

●●

●●●

●●

●●●

●● ●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●

● ●●

●●●

●●

●●●

●●●

● ●●

●●●

●●

●●●●●●●

●●●

●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●

●●●●

●● ●

●●●●

●●●

●●●●●●

●● ●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●●● ●

●●

●●●●

●●●●●●●●

●●

●●

●●●

●●●●●●●● ●●

●●●●●●●

●●●●●●●●

●●●●

●●●●●

●●●●

●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●● ●

●●●●

●●●

●●

●●●●

●●●

●●●●●●

●●

●●●●●

●●

●●●●●●●

●●●●●●●

●●

●●●

●●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●

●●

●●●

●● ●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●●●

●●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

NVIDIA Tesla K20m

Bandwidth Copy (% of peak)

Ban

dwid

th A

dditi

on (

% o

f pea

k)

Copy vs. Addition

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●●

●●

●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●● ●

●●

●●●●

●●●

●●●●

●●

●●● ●

●●●

●●●●

●●

●●●●

●● ●

●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●●

●●●●

●●●

●●●●

●●●

●●

●●

● ●

●● ●●

●●

●●●

●●●●●●

●●●●●●

●●

●●●●●

●●

●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●●

●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●● ●

●●

●●●●

●●●

●●

●●●

●●

●●● ●

●●●

●●●●●

●●●●●●

●●

●●

●●●●●

●●●●

●● ●

●●●●●

●●●●●●●

●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●●●

●●●

●●

● ●●

●● ●●●

●●●

●●●

●●●

●●

●●●●

●●

●●

●●●●●●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●●●

●●●

●●

●●●

●●●

●●●

●●

●●●●●

●● ●●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●●●

●●●●●

●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●●● ●●

●●●

●●

●●●

● ●●

●●●

●●

●●●

●●●

●● ●●

●●●

●●

●●

●●●●

●● ●●●●

●●

●●●

●●

●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●

●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●● ● ●●

●●

●●●●

●●

●●●

● ● ●●●●

●●●

●●●●●●●● ●●

●● ●

●●

●●●●●●●

●●●●●●

●●

● ●

●●●

●●

●●●●●●

●●●

●●

●●

●●

●●●●●●●●●●●

●●

●●●●●●●●●●●●●● ●●

●●●●●●●

●●●●●●●●●●●

●●●●●

●●

●●

●●●

●●●●●●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●●

● ● ●● ●●

●●●●

●●

●●●

●● ●●●●●

●●●

●●●●●●

●● ●●●●●

●●

●●●●●●●●●●●●

●●

●●

● ●

●●●

●●●

●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●

●●

●●●●● ●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

NVIDIA Tesla K20m

Bandwidth Copy (% of peak)

Ban

dwid

th In

ner

Pro

duct

(%

of p

eak)

Copy vs. Inner Product

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●● ●●●●●●●●●●

●●●●

●●●●● ● ● ● ● ●●●

● ● ●●●●

●●●● ●●

●●

●●

●●

●● ●

●●●

●●● ●●

●●

●●●

●●

●●●

●● ●●

●●●●

●●

●●

● ●

● ●●

●●●

●●●●●●●

●●

● ●●

●●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●● ● ● ● ● ●●●

● ● ● ●●●●

●●●● ●●

●●

●●

●●

●●

●●●●

●●● ●●

●●●

●●

●●●●

●● ●●

●●●●●

●●

●●●●

● ●●

●●●

●●

●●

●●●●●

● ●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●● ● ● ● ● ●●●

●●●●●●

●●●● ●●

●●

●●

●●

●● ●

●●●

●●● ●●

●●

●●●

●● ●

●●●

●● ●●

●●

●●

●●●

●●

● ●

● ●●

●●●●●

●●●●●● ●

● ●●

●●●

●●●●●●●

●●

● ●●

●●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●● ● ● ● ● ●●●

● ● ● ●●●●

●●●● ●●

●●

●●

●●

●●

●●●●

●●● ●●

●●●

●●

●●

●●●●

●● ●●

●●

●● ●

●●

●●●●

● ●●

●●●●●

●●●●●●

● ●●

●●●

●●●●●●●●●

● ●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●● ● ● ●●●●●●●●

●●●●

●●●●●●

●●

●●●

● ● ●●●●

●●●●

●●

●●

●●●

●● ●

●●●

●●●●

●●

●●●

●●●

●●●●

●● ●●

●●●●●●

●● ●●

●●

● ●●

●●

●●●●

●●●●●● ●

● ●●

●●

●●

●●●● ●●●●●●

● ●●

●●

●●

● ●●●●●●●●●● ●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●● ● ● ●●●●●●●●●●●●

●●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●● ●●

●●●●

●●

●●●

●● ●

●●●●

●● ●●

●●●●●

●● ●●

●●●

● ●●

●●

●●●●

●● ●●●●●

● ●●

●●

●●

●● ●●●●●

●●

● ●●

●●

●●

● ●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●● ● ● ●●

●● ●●

●●●●

●●●● ●●

● ●● ●●

●● ● ●

●●●●

●●●●

●● ● ●●●

●● ● ●

●●●●

●●●●

●●●●●●●●

● ●●●●

●● ●●

●●

●●●●●●●●●●●●

● ●●

● ●●

●●●●●●●●●●●● ●

● ● ●●●

●●●●●●●●●●●●●●

● ● ●●

●●

●●●●●●●●

●●● ●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●

●●

● ● ●●●

● ●● ●●●●

●●●● ●●

● ●● ●●●● ● ●●

●●●

●●●●

●● ● ●

●●●

●● ●●●●●

●●●●

●●●●●●●●● ●●●●

●● ●●

●●

●●●●●●●●●●●●●

● ●●

● ●●

●●●●●●●●●●●●●

● ● ●●●

●●●●●

●●●●●●●●●

● ● ●●●

●●

● ●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

NVIDIA Tesla K20m

Bandwidth Copy (% of peak)

Ban

dwid

th M

atrix

−V

ecto

r (%

of p

eak)

Copy vs. Matrix−Vector Product

Page 54: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

40

Portable PerformancePortable Performance

AMD Radeon HD 5850

Addition Inner Product Mat-Vec Product

●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●●

●●●●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●●●●●●

●●●●●

● ●

●●●

●●●●●

●●●●●

●●●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●●●

●●

●●●●

●●

● ●

●●●

●●

●●

●●●●●

●●●●●

●●

●●●●●●

●●●●

●●●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

● ● ●●

●●●●

●●

●●

● ●

● ●●●●

●●

●●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●

●●

●●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●

●●●●

●●

●●●

●●●●●●●●●

●●●

● ●

●●

●●●●

●●

●●●

●●●●

●●

● ●

●●●

●●

●●●●●

●●●●●●●●●

●●●●

●●

●●●●●●●●●●●

●● ●

●●●

●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●

●●●

●●●

●●●●●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●

●●●●●

●●●●●●●

●●

●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●●

●●●●

●●●

●●●●●

●●●●

●●●●●●●●

●●●●●●

●●●

●●

●●●●●

● ●●●●

●●●

●●

●●●●●●●

●●

●●●●●●●●●●

●●●

●●●●●

●●

●●

●●●●●

●●●●

●●●

●●●●●●●●●●

●●●●●●●

●●●●●

●●●

●●●●

●●●●●●●●●●●●●●

●●●

●●●

●●●●●●●●● ●

●●●

●●

●●●●

●●

● ●

●●●

●●●

●●

●●●

●●

●●●●●

●●

●●●

●●●●●

●●●●●

●●●

●●●●●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●● ●●●●

●●●

●●

●●

●●●●●●●

●●

●●●

●●●● ●

●●●●

●●●●●●●●●

●●●●●

●●●●

●●●

●●●●

●●

●●●●●

●●●●

●●●●●

●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

AMD Radeon HD 5850

Bandwidth Copy (% of peak)

Ban

dwid

th A

dditi

on (

% o

f pea

k)

Copy vs. Addition

●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●

●●●●

●●●●●

●●●● ●

●●●

●●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●●●●

●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●●●●●●●

●●●●●

●●●●

●●● ●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●●●●●

●●●

●●●●●●●●

●●●●●

●●●

●●

●●●●

●●●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●●●●●●●●●●●

● ●

●●●●●

●●●●●●●

●●●●●●●●●●

●●●●

●●●●

●●●●

●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●

●● ●

●●●●●●●●●●●●●

●●●●

●●●●●●

●●●●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●● ●

●●

● ●●

●●●●●

●●●●●●●●

● ●

●●

●●●●●

●●

●●●

●●●●● ●●●●

●●●●●●

●●●

●●●●

● ●●●●

●●●●

●●●●

●●●●●●

●●

●●●●●●

●●●

●●●

●●

●●●●

●●●

●●

●●●●●●●●●●

●●●●

● ●

●●

●●●●●

●●

●●●●●

● ●●●●

●●●

●●●●●●●●●●

● ●

●●●●●●●●●●

●●●●

●●●●

●●

●●●

●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●●

●●●●

●●●●●

●●

● ●●●●

●●●

● ●●●

●●

●●●

●●

●●●●●●●●

●●●●

●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●●●●●●

●●●●

●●●

●●●●

● ●● ●

●●

●●●

●●

● ●●●●

●●●●●

●●

●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

AMD Radeon HD 5850

Bandwidth Copy (% of peak)

Ban

dwid

th In

ner

Pro

duct

(%

of p

eak)

Copy vs. Inner Product

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●● ●●●● ●●●●●

●●●● ● ●● ● ●●● ●● ●●

●●●

●●● ●●

●●

●●

●●

● ●●

●● ●

●● ●●

●●

●●

●●

●●●

● ●●

●●●●●●

●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●● ● ● ● ● ●●

● ● ● ●●●●●

●●● ●●

●●

●●

●●

●● ●● ●

●● ●●

●●

●●

●●●

●●

● ●●

●●

●●●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●● ●●●● ●●●●●

●●●● ● ●●● ●●

● ●●●● ●●●

●●● ●●

●●

●●●

●● ●

●●

●●

●● ●●

●●

●●

●●

●●●

●●

● ●●

● ●

●●●●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●● ● ● ● ● ●●● ● ●●

●●●

●●● ●●

●●

●●

●●

● ●● ●●

●● ●●

●●

●●

●●

●●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●● ● ●●●●●●●●●●●●●

●●● ●●

●●●

●●

●●●●●

●● ●

●● ●●

●●

●●

●●

●●●

●●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●● ● ●●●●●●●●●●●●

●●● ●●

●●●

●●

●●●●

●● ●●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●●●●●●●●●●●●

●●

●●●●●●●●●●

●●●

●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●●●●

●●

●●●●●●●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●●

● ●●●●

●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●

●●

● ●●●●●

●●●●

●●

●●

●●●

●●

●●●●

●●●●●●●●●●

●●

●●

●●●

●●●●●

●●

●●

● ●●

●●●●

●●

●●●●●●●●●●●

● ●

●●

●●

●●●

●●●●●●

●●●●

●●●●

●●●●●●●●●●●●

●●●●

●●●●●●

●●●●●● ●

●●●

● ●

●●●●

●●●

● ●

● ●●

●●●●●●●

●●

●●

● ●●

●●●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●

●●●●

●●

●●●●●●

●●●

● ●

●●●

●●●

●●

● ●●

●●●

●●●●

●●

●●

● ●

●●

●●●●●●●●●

●●

●●

●●●●

●●●

● ●

●●

●●●●

●●●●

●●

●●●●●●●●

●●●

● ●

0 20 40 60 80 100

020

4060

8010

0

AMD Radeon HD 5850

Bandwidth Copy (% of peak)

Ban

dwid

th M

atrix

−V

ecto

r (%

of p

eak)

Copy vs. Matrix−Vector Product

Page 55: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

40

Portable PerformancePortable Performance

AMD FirePro W9000

Addition Inner Product Mat-Vec Product

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●●●●●

●●●●●

●●●●

●●

●●●●●●●●

●●●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●●●

●●●

● ●

●●

●●●

● ●● ●●●

●●

●●●

●●●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●

●●●●●●●

● ●●

●●●●●●●

●●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●● ●●

●●●

●● ●●

●●●●

●●●

● ●●●●●●●●

●●●

●●

●●●

●●

●●

●●●●●●●

●●●●●●●●●●●

●●●●●

●●●

●●●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●●

●●

●●

●●●●●●●● ●●●●●

●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●● ●●

●●●

●● ●●●●●●

●●●

●●

●●●●●●●●

●●

●●

●●●● ●

●●●●

●●●●

● ●●

●●

● ●●●●●●

●●

●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●

●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●●

●●●●

●●

●●

●●●●●●

● ●●●

● ●

●●

●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●

●●

●●●●●●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●

●●

●●●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●

●●●●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●

●●●

●●●●

●●●●●●●

●●●●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●●

●●●

● ●

●●

●●

●●●●●● ●

●●

●●

●●

●●

●● ●●

●●

●●●●

●●

●●●●●●●

●●●

●●

●●●●●●●●●● ●

●●●●●●●

●●●●

●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●●

●●●●

●●●

●●

●●●

● ●●●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●●

●●

● ●

●●●●●●●●●●

●●

●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●

●●

●●

●●●●●●●

●●●

●●●●

●●

●●●●●●●●

●●●

●●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●

● ●●●

●●● ●

●●

●●

●●

●●

●●●●●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●●

●●●●

●●

●●●●●●●●

●●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●

●●

● ●

●●●●●

●●●●

●●

● ●

●●

●●●●●●●●●●

●●

●●●●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

AMD FirePro W9000

Bandwidth Copy (% of peak)

Ban

dwid

th A

dditi

on (

% o

f pea

k)

Copy vs. Addition

●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●

●●●

●●●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●●

●●

●●●●

●●

●●●

●●

●●●●●●●●

●●

●●

●●●

●● ●●●●

●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●●

●●●●●●●

●●●

●●●●●●●●●●●

●●

●●● ●● ●●●●●

●●●

●● ●● ●● ●●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●

●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●●● ● ●●

● ●●●

●●

●●●●●●●●

●●●●●●

●●● ●●●●●●●●

● ●●

●●●●●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●

●●

●●●●●

●●●●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●● ● ●●●

●●●●

●●

●●

●●

●●

●●●

●●●●●●●●●

●●

●●

●●● ●● ●●●●●

●●

●●

●● ●● ●● ●●

●●●●●●

●●● ●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●

●●●●●●●

●●●●●

●●

●●

●●●

●●●● ●●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●●●

●●●

●●

●●

●●

● ●● ●●

●●●●

●● ●

●●●●●● ●●

● ●

●●●

●●●●●●● ●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●●

●●●●

●●

● ●●●●●●●

●●●

●●

●●●●●●

●●

●●●

●●●● ●●●

●●

●●●

●●

● ●●●

●●

●●●

●● ●

●●●●●●●

● ●

●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●

● ●●●

●●●●●●●●●●

●●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●

●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●●●●●●●●● ●

●●●●●●●●●

●●●●●●●●●●

●●●●●

● ●●●

●●●●●●●●●●

●●●●

●●

●●

●●

●●●●●●●●

●●●

●●

●●

●●●●●●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●● ●

●●●

●●●

●●

●●

● ●

●●●

●●●

●●●

●●

●●

●●●●●●●

●●●

●●

●●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

AMD FirePro W9000

Bandwidth Copy (% of peak)

Ban

dwid

th In

ner

Pro

duct

(%

of p

eak)

Copy vs. Inner Product

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●

●●

●●●●●●●●●●●● ●●●● ●●●

●●●●●●●● ●●●●● ● ●

● ●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●● ●●

●●

●●●●

● ●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ● ●●

●●●●●●●●●●●●●●●● ●●●

●●●●●●●●●●●●● ● ●

●●

●●

●●●● ●●

● ●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●● ●●

●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●

●●

●●●●●●●●●●●● ●●●● ●●●

●●●●●●●● ●●●●●

●●● ●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●

●● ●●

●● ● ●●●

●●

● ●●

●●● ●●

●●

● ●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●● ●

●●

●●●●●●●●●●●●●●●● ●●●

●●●●●● ●●●●●●●

● ●●

●●

●●●● ●●

● ●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●● ●●

●●●●●●

●●

● ●●

● ●

● ●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●● ● ●

● ●●●

●●●●●● ●

● ● ● ●●

●● ●

● ●●●

●●●●●

●●

●●

●●

●●

●● ●●●

●●● ●●

●●

●●●

●●●

●● ●●

●● ●●●

●●●

● ●●

●●●●●●

●●●●

● ●●

●●●

●●●●●● ●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●● ● ●

●●●●

●●●●●●

● ●●

●●

●●

● ●● ●●●

●●●●●

●●

●●

●●

●●●

●●●●

●●●●

●●

●●●●

●●

●● ●●

●●●●●●

● ●●

●●●●●●

●●●●●

● ●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●● ●●●●●●●

●●●●● ● ● ● ● ●●● ●●●● ●●●

●●●● ●●

●●

●●

●● ●●●

● ●●●

●●● ●●

●●●

● ●●●● ●●●

●● ●●

●●

● ●●●●

●●●

●● ●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●●●● ●●

● ● ●●

●●●

●●●●●●●● ●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●● ● ● ● ●●

●●●●●●●●

●●●● ●●

●●

●●

●● ●●●●●●●

●●● ●●

●●

● ●

●●●●●●●●

●● ●●

●●

●●●●●

●●●

●● ●●

●●

●●

●●

●●●●●

● ●●

●●

●●

●●●●●●●

● ● ●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●● ● ● ●● ●●●●●●●●●●

●●●● ●●

●●●

●●●●●●● ●●●

●●● ●●

●●●●

● ●●●● ●●●

●● ●●

●●●●

● ●●●●●● ●

●● ●●

●●

●●

●●●

●●● ●

● ●●

●●

●●●●●●●●●

● ●●

●●

●●●●●●●

● ●●

●●

●●●●●●● ●

●●●●●●●●●●●●●●●●●●●

●●●●● ● ● ●● ●●●●●●●●●●

●●●● ●●

●●●

●●●●●●●●●●

●●● ●●

●●●

●●●●●●●●●●

●● ●●

●●

●●●●●●●●●●

●● ●●

●●

●●

●●●●●●●

● ●●

●●

● ●●●●●

●●

● ●●

●●

●●●●●●●●

● ●●

●●

●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0

AMD FirePro W9000

Bandwidth Copy (% of peak)

Ban

dwid

th M

atrix

−V

ecto

r (%

of p

eak)

Copy vs. Matrix−Vector Product

Page 56: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

40

Portable PerformancePortable Performance

INTEL Dual Xeon E5-2670

Addition Inner Product Mat-Vec Product

●●●

●● ●●● ●●●●●●●●

●●

●●

●●● ●●● ●●

● ●●●●

●●

●●

●●●

●● ●●●●●● ●●

●● ●

●●

●●●● ●●●●●●●

●●

●●●●

● ●

● ●●●●●●●● ●

●●●

● ●●●●

●● ●●

●●●●● ●

●●●

● ●●

●●

●●● ●●● ●●● ●●

●●

● ●●

● ●●●●● ● ●●●● ● ●

● ●● ●

● ●●● ●●●●●● ● ● ●

●● ●●

●●● ●●● ●●●● ●● ●

●●●●●

●●●●●●

●●●●

●●

●●●●●●●●●●

●●●

●●●

● ●●●●●● ● ●●

●●●

●●

●●●●● ●

● ●●●● ●

●● ●●

●●●

●●●

●●●

●●

●● ●●●● ●

●●

●●

●●●●●●

●●●● ●●

●●●●

●●

●●●●●

●●●

●●●●

●●●●

●●●

●●●

●●

●●●

●● ●●

●● ●● ● ●●

●●●

●●

●●● ●●

●● ●● ● ● ●

●● ●●

● ●● ●● ●●● ●● ●● ●

●●

●●

●●●● ●●●●●●●●●

●●●

●●

●●● ●● ●●●● ●●●●

●●

●●

● ●● ●●● ●●●● ●● ●

●●

● ● ●

●●● ●

●●● ●●●

●● ●

●●

●● ●

●●●●● ●●

●●●●●●

●●

●●

●●●●

● ● ●● ●●● ●●●

●●

●●

●● ●●

●●● ●●●●●●

●●●

●● ●● ●●●●●●●●●

●●●

● ●●●● ● ●●●●●

●●

●●●

●● ●●● ● ●●●●●

●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●

●●

●●●

●●

● ●

●●●● ● ●

●●

●●

●●● ●

●●

●●

●●●

●● ●

●●

● ●●●

●●

●●

●● ●●

●●●

●●●

●●

●●

●● ●

●● ●

●●

●●●

●●

●● ●

●● ●●● ●

●●●

●●

●● ●● ●● ●●

● ● ●

●●●

●●●●

●● ●●● ●●●●

● ● ●●

●● ●●

●●● ●● ●●

●●

●●●

●● ●● ●●●●●●●●●

●●●

●●

●● ●●● ●

●●●● ●●●

●●

● ●

●●●● ●●●

●●●●●●

●●●

● ●

●●● ●●●● ●●●●● ●

●●

● ●●

●●●● ●●

● ●●●●●●

●●

●●

●●● ●● ● ●● ●● ●

●●●

●●●

● ●●●● ● ●●●●●

●●

●● ●

●●

●●● ●●● ●●●●●

● ●●

●●●●●● ●● ●● ●●

●●

● ●●

● ● ●● ●● ●●●

● ●● ●●

●●●●●

●●

●●●

●●●

●●

●●●●●●

●●

●●

●● ●●●

●●

●●

●●

●●●●

●●●● ●● ●●●

●●

●●

● ●●●

●●●

●● ●● ●● ●●

●●

● ●●

●●

●● ●●

●●●●● ●

●●

●●

●● ●● ●● ●●● ● ●

●●

●●

●●●● ●● ●●● ●● ●

●●●

● ● ●●●● ●● ●●●

●●●

●●

●●

●● ●●● ● ●

●● ●●

●●

●●

●●

●●●●● ●● ●● ●●●●

●●

●●

●●●●● ●

●●●●●●●

●●●

● ●

●●●●● ●●●● ●●

●●

●●●

●●

●● ●

● ●●●●●●●

●●●

● ●●

●●

●● ●●

●●●●●●●●●

●●●

● ● ●●● ● ●●●●●●●

● ●●

●● ●●● ●●●●●●

●●●

●●

● ●

●● ● ●●● ●●●●●

●●●

●●

● ●

●● ●●● ●●

●●●

●●● ●

●●

●●

●● ●●●

●●●●● ●● ● ●

●●●●

●●●

●● ●

●●●●

●●●

●●●

●● ●●

●● ●●

●●

●●●

● ● ●

●●●●

●●

● ●●●

●● ●

●●

●●●●●●

●●●

●●

●●●

●●

●●

●● ●

●● ●●

●●

●●●

●●

●●●

●●●●●

● ●●

●● ●

●●

●●●

●●●● ●●●

● ●●

●● ●

●● ●●● ●● ●● ●●

●●

●● ●

●●

● ● ●●●

●●

●●●●●

●●●

●● ● ●

●●●

●●●●

●● ●

●●

●●

●●●● ●● ●●●●●●●

●●● ●

● ● ●● ●●●●●●●●●

●●

● ●

●●●●●●●

●●●●●●

●●●

● ●●●● ●●

● ●●●●● ●

●●

●●

●●● ●● ●●●●●● ●●

●●

●●

●● ●●● ●●●●●●●● ●

●●

●●

●●●●● ●● ●●● ●

●●●

●●●

● ● ●● ●●● ●●● ●● ●

●●●

●● ●●●

●●●●● ●● ●●

●●●

●●

●● ●●●●●● ●●

●●

●●● ●

●● ●●

●● ●●●

●●

●●●●

●●

●●●●

●●

● ●●●

●●

●●

●●

●●●

● ●●

●● ●●

●●

●●●

●●

●●

●● ●

●●●●

●● ●

●●●

●● ●●

●● ●● ● ●●

●●

●●

●●

●●●●

●● ●●●●●●●

●●●

●● ●

●● ● ●●● ●●

●●●

●●●

●● ●●

● ●●●

●●●●

●●

●●●

●●

● ●●● ●

●● ●●

●●●

●● ●

●●

●●●●●●●●

●●●

0 20 40 60 80 100

020

4060

8010

0

2x INTEL Xeon E5−2670

Bandwidth Copy (% of peak)

Ban

dwid

th A

dditi

on (

% o

f pea

k)

Copy vs. Addition

●●

●●●●

●●●●●●

●●

●●

●●●

●●●

●●●

●●●

●●●

●●

●●

●●●●●●●●

●●

●●

● ●●

●● ●●

●●●●●●

●●

●●

●●●

●●

●●● ●●●●

●●

●●● ●

●●●●● ●

●●●

●●

●●● ●●●

●●

●●

●●

● ●●

●●●●

● ●● ●

●●●

●●

●●

●●●●

●●●●●● ●

●● ●

●●● ●●● ●●●● ●

●●

●●●●●

●●●

●●●

●●●

●●●●●●●●

●●

●●●

●●

● ●●

●●●●

● ●

●●●●

●●●

●●

●●● ●●●

●●

●●●

●● ●

●●

●●●●●

●●●

●● ●

●●

●●●●●●

●●●

●●●

●●●●●

●●●

●●●

●● ●

●●●●

●●

●● ●

●● ●● ●● ●

●●● ●

●●

●● ●● ●● ●●● ●

●● ●●●

●● ●●

●●● ●● ●●

●●

●●

●●●

●●●

●●●●●●●

●●

●●●

●● ●●●● ●

●●●

●●

●●

●●●●

● ●

●●●

●● ●

●●

●●●●

●●●●

●●●●● ●

●●●

●●

●● ●

● ●●●

●●●●

●●

● ●

●●● ● ●●

●●● ●

●●

●● ●

●●●● ●●

●●

●●

● ●

●●●

● ●●● ●●●●●●●

●●

●●●

● ●●●●

● ●●●●●

●●●

●●●●● ● ●

●●●

●●●

●●●●

●●●●

●●●

●●●

● ●●●●●●

●●

●●

● ●●●●●

● ●

●●

●●●

●●● ●●

●●

●●

●●●

●●

●●

● ●●●●

●●●

●●●

● ●●●

●●●

●●●

●●

●●●●

●●●

●● ●● ●

● ●

●●●●

●●●● ●

●●●● ●

●●●

●● ●●

●● ●●● ●

●●

● ● ●●

●●●●●

●●●● ●

●●●

●●

●●●

●● ●●●●

●●●●

●●

●●

●●●

●●

● ●

●●

●●●●

●●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●● ●●●●● ●

●●

●●

●●

●●●

● ●●●

●●●

● ●

●●●●●

● ●● ●●

●●●●

●●●

● ●●●● ● ●

●●

●●

● ●

●●

●●

●●●●

● ●●●●●●

●●

●●●

● ●●●●● ●

●●●

●●●

●●

●●

● ●●

● ●●

●●●● ●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●●●

●●●

●●●

●●●

●●●

●●●

● ●

●●●

●●

●●●● ●●

●● ●●

●●●● ●● ●●

● ●

● ● ●●

●● ●●

●● ●●● ●

●●

●●●

● ●●●●● ●

●●●

●●●

●●●

●●

●●●

●●●● ●

●●

●●●

●●●

● ●●

●●

●●●●

●●

● ●

●●●

● ●●●●●●●●

●●

●●●

●●●

●●●●

●●●●

●●

●● ●●●

●●●●●●

●●●

●●

● ●●●●

●●●●●●●●●

●●●

● ● ●●● ● ●

●●●●

●●

●●

●● ●●● ●●●●●●

●●

●●●

●● ● ●●● ●

●●●

●●●●

●●●

● ●●●

● ●

●●●●

●●●

●●

●●

●●●●

●●●

● ●●

●●●●

●●

●●

●●●

●●

●●

● ●● ●

●●

●●

●●

●●●●

●●

●●●

● ● ●●

●●●

●●●

●●●

● ●

●●●●

●●●

●● ●●

●●

● ●●●

●●

●●●● ●● ●●●

●● ●●

●●● ●●

●●●●●●

●●

● ●●

●●●●● ●●

●●

●●

● ● ●●

●●

● ●●

●●●●●●●

●●●

●●

●●●● ●●

●●●

●●

● ●

●●

●●

●●●●●●

●●

●●

●●●

●●●●●●●●

●●●

●●●●

●●

●●●

●●●●●

●●

● ●●

●● ●

●●●

●●●

● ●

●● ●

● ● ●

● ●●●●●● ●

●●

●●

●●

●●●

●●●●●●●

● ●

●●

●●●●●

●● ●●●●

●●●

●●

● ● ●

●●●● ●

●● ●

● ●

●●

●● ●●

●●●●●● ●●●

●●●●●●

●●●

●●●●

●●

●●

●●●

●●●

●●

●●● ●

●●

●●●

●●●

●●●

●●

● ●

●●●

●●

●●

●●●

●●

●●

● ●●

●●

●●●●

● ●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●● ●

●●

● ●●●● ●●● ●●

●●●

●●●

● ● ●●●●●●●

●●

●●●

●● ●●●

●●● ●

●●

●●● ●

●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

2x INTEL Xeon E5−2670

Bandwidth Copy (% of peak)

Ban

dwid

th In

ner

Pro

duct

(%

of p

eak)

Copy vs. Inner Product

●●

●●●

●●

● ●●●●●

●●●

●● ●

●●

●●

● ●●● ●●● ●● ●

●● ●

●●●

●●●●●● ●●

●●

●●

● ●● ●●

●●●●

●●●

●●●●

●●

●● ●●●●●

● ●●●●

●●●

●●● ●● ●●●●● ●

●●●

● ●●

●●● ●●● ●●● ●

●●●●

●●

●●

●●●●● ● ● ●●●●

● ●●

● ●●●

●● ●●

●●●● ● ● ● ●

●● ●●

●●● ●●●

●●●● ●● ●●

●●

●●●●

●● ●●●

●●●●

●●

●●

●●

●●●● ●

● ●● ●

●●●●

●● ●●

●● ●●●

●●●

●●

●●

● ●● ●●●●●

●●●●

●●

● ●● ●● ●●

●●●●●

●●●●

●●

● ●● ●● ●●

●● ●● ●

●●●●

●●●

●●●● ●●

●●●● ●

●●●●

●●●

●● ●● ●● ●●

● ● ●

●●● ●

●●

●●

●●

● ●● ●● ● ● ●

●● ●●●

●●●

●● ●●● ●● ●● ●●

●●

●●●●●

●●●●●●●●●

●●

●●● ●

● ●●●●●

●●●

●●

●● ●●

●●● ●●●●

●● ●

●●

●●

●●●

●● ●●●●●

●●●

●● ●●

●●●●● ●●

● ●●

●●

● ●●●

●●

●●●

●● ●

●●●

●●● ●●

●● ●● ●

● ●●●

●●

●●●●●

●●●

● ●●●

●●

●●●●●●

●●●

● ●●●

●●

●●● ●

●●●●●●●●

●●

●●●

●●● ●●●

●●

●●●

●●●

●●●●● ●●●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

● ●●

●●●●●

●●●

●●

●●● ●

● ●●●●●● ●

● ●●●

●●● ●● ●● ●

●●●

●● ●

●●●

●●

● ●● ●● ●● ●●

● ● ●

●●●●

●●

●●

● ●●●●● ● ●

●●● ●

●●

●●● ●●● ●●●●

● ● ●●●

●●

●●●●● ●● ●●●● ●

●●

● ●

●●

●● ●●●●●●●●●

●●

●●●●

●● ●●●●● ●●

●●

●●●●●

●●

●●●●

●●●

●●●

●●●

●● ●●●

●● ●

●●●

●●●●●● ●●●

●●●

●● ●

●●

●● ● ●

● ●●

●●●●

●●●

●●

●●

●●●●●●● ●

●●

●●

●●

●●

●● ●●●

●●●

● ●● ●

●●●

●● ●● ●

●●●

● ●● ●●

●●

●● ●●

●●●● ●● ●●

●●

●●

●●●●

●● ●●

● ●●

●●

●●●

●●●

●●●● ●

●●

●● ●

●●

●●●

● ●● ●●●●●

●●●

●●

●●

● ●●

●●●● ●● ●

●●●

●●

●●● ●● ●● ●● ●● ●

●●●

● ●● ●

● ●● ●●

●●● ● ●

●● ●●

●●

● ●● ●● ●●

●●● ● ●

● ● ●●

● ●●● ●

● ●●

● ●● ●

●●

●●

●● ●●●●●●

● ●●●

●●

●●● ● ●●● ●● ● ●●

●●

●●

●●●●●●

●●●

●●●

●●

● ● ●

●●

●● ●●●●●●●●

●●

●●

●●●●●●

●●●● ●●●●

●●●

●●●

●●●

●●

●●●

●● ●

●●●●●

●●

●●●●●

●●●

●●● ●

●●●

●●●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●●●

●●●

● ●●●

●●● ●

●●●●●●● ●

● ●● ●●

●●●●●●

●●●● ●● ● ●

●●

●●

●● ●

●●

● ●●

●●●●

●●

●●

●●

●●

●● ●●●●●

●●●

●●

●●

●●● ●●

● ●● ●

●● ●

●●●●

● ●●●● ●● ●

●●●

●●

● ●● ●●●●●● ● ● ●

●●●

●●

● ●● ●● ●●●

●● ●●

●● ●

●● ●●●

● ●●

● ●●●

● ●● ●●●●●

● ● ●●

●●●

●● ●●●●●●

● ●● ●●

●●●

●●●● ●●●●●● ●

●●

● ● ●

● ●●

●●●●

●●●

●●

● ●

●●

● ●●●●

●●●●●●

●●

● ●

●●●●●

●●●●

●●●

●●

●●●

●●

● ●●● ●

●●

●● ●

●●

●●

●●●

●●●●

●●●●●

●●●

●●

●●●●

●●●●●● ●

●●●

●● ●

● ●●● ●●●●

●●●

●●

●● ●●●●● ●

● ●●●

●●

●●

●●●● ●● ●●

● ●●●

●●

●●

●● ●●●● ●●●●

●●

●●

●●

●● ●●

●● ●●

●●

●●

●● ●●

●● ● ●●● ●●●

●●

●●●●

●●●

● ●● ●● ●

●●●

● ●

●●● ●

●●●●●

●● ●

●● ●

● ●

● ●● ●●●● ●●

● ●●

●● ●

●●

●●

● ●● ●●●●

●●●

●●●

●●

●●

●●●●●

● ●●

● ●● ●

●●

●●●●●●● ●

● ●●●

●●●

●●●● ●●

●●●

● ● ● ●●

●●

● ●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

2x INTEL Xeon E5−2670

Bandwidth Copy (% of peak)

Ban

dwid

th M

atrix

−V

ecto

r (%

of p

eak)

Copy vs. Matrix−Vector Product

Page 57: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

40

Portable PerformancePortable Performance

INTEL Xeon Phi

Addition Inner Product Mat-Vec Product

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●● ●

●●●●

●●

●●●●●

●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●

●●●●●●●●●●

●●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●

●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●● ●●●

●●●●●●●●●●

●●●●●●●●

●●●●●

●●●

●●●●●

●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●●

●●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●

●●

●●●●●●

●●●●●●●

●●●●●

●●●

●●●●●●●●●●●

●●●●●●●

●●●● ●●

●●●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●

●● ●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●● ●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●

0 20 40 60 80 100

020

4060

8010

0

INTEL Xeon Phi 5110P

Bandwidth Copy (% of peak)

Ban

dwid

th A

dditi

on (

% o

f pea

k)

Copy vs. Addition

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●● ●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●● ●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●● ●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●● ●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●● ●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●● ●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

INTEL Xeon Phi 5110P

Bandwidth Copy (% of peak)

Ban

dwid

th In

ner

Pro

duct

(%

of p

eak)

Copy vs. Inner Product

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●● ●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●● ●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●● ●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●● ●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●● ●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●● ●●●●●

●●●●●●●

●●●●●●●●●●●●●●

●●●● ●

●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●● ●●●●●●●●

0 20 40 60 80 100

020

4060

8010

0

INTEL Xeon Phi 5110P

Bandwidth Copy (% of peak)

Ban

dwid

th M

atrix

−V

ecto

r (%

of p

eak)

Copy vs. Matrix−Vector Product

Page 58: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

41

Portable PerformancePortable Performance

Conclusio:

Focus on fastest configurations for copy-kernel sufficient

Good choice:

Local workgroup size of 128 with 128 workgroups

Page 59: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

42

Portable PerformancePortable Performance

Matrix-Matrix Multiplication

Compute-bound

Block-decomposition to maximize cache utilization

K

K

kl

kl

ks

ksM ml ms

N

nl

ns

*=

Page 60: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

43

Portable PerformancePortable Performance

Matrix-Matrix Multiplication (Single Precision)

102

103

104

102

103

104

GF

LO

P/s

Matrix Rows/Columns

INTEL Core i7 960 - MKL 11.0.2

INTEL Core i7 960 - ViennaCL

NVIDIA GTX 470 - CUBLAS 5.0

NVIDIA GTX 470 - ViennaCL

AMD HD7970 - clAmdBlas 1.10

AMD HD7970 - ViennaCL

Ph. Tillet et al.: HotPar’13

Page 61: An Introduction to OpenCL for Scientific Computing - … · An Introduction to OpenCL for Scientific Computing ... Dot Product z y a. 22 ... Comparison with CUDA and OpenACC OpenACC

44

SummarySummary

OpenCL

Program CPUs and accelerators (GPUs, MIC, etc.) from different vendors

Clean integration into existing projects

Not a silver bullet

Recommendations

Program with data locality and data movement in mind

Compilers cannot fix bad programming

There is no free lunch

Convenient Use of OpenCL

Use libraries built on top of OpenCL

Tue, July 15, 9:00am:Lessons Learned in Developing the Linear Algebra Library ViennaCL