PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

72
Adapting Languages for Parallel Processing on GPUs Neil Henning Technology Lead Neil Henning [email protected]

description

PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning at the AMD Developer Summit (APU13) November 11-13, 2013.

Transcript of PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Page 1: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Adapting Languages for Parallel Processing on GPUs

Neil Henning – Technology Lead

Neil Henning [email protected]

Page 2: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

● Introduction

● Current landscape

● What is wrong with the current landscape

● How to enable your language on GPUs

● Developing tools for GPUs

Neil Henning [email protected]

Agenda

Page 3: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Introduction

Neil Henning [email protected]

Page 4: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

● Five years in the industry

● Spent all of that using SPUs, GPUs, vectors units &

DSPs

● Last two years focused on open standards (mostly

OpenCL)

● Passionate about making compute easy

Neil Henning [email protected]

Introduction – who am I?

Page 5: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

● GPU Compiler Experts based out of Edinburgh, Scotland

● 35 employees working on contracts, R&D and internal tech

Introduction – who are we?

Page 6: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

Neil Henning [email protected]

Page 7: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

● Languages – CUDA, RenderScript, C++AMP & OpenCL

● Targets – GPU (mobile & desktop), CPU (scalar & vector), DSPs, FPGAs

● Concerns – performance, power, precision, parallelism & portability

Neil Henning [email protected]

Page 8: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape - CUDA

● CUDA incredibly established

● First major GPU compute approach to market

● Huge bank of tools, libraries and knowledge

● Used in banking, medical imaging, game asset

creation, and many many more uses! Neil Henning

[email protected]

__global__ void kernel(char * a, char * b) { a[blockIdx.x] = b[blockIdx.x]; }

char in[SIZE], out[SIZE]; char * cIn, * cOut; cudaMalloc((void **)&cIn, SIZE); cudaMalloc((void **)&cOut, SIZE); cudaMemcpy(cIn, in, size, cudaMemcpyHostToDevice); kernel<<<SIZE, 1>>>(cOut, cIn); cudaMemcpy(out, cOut, size, cudaMemcpyDeviceToHost); cudaFree(cIn); cudaFree(cOut);

● Using CUDA means abandoning compute on

majority of devices

● Really only had uptake in offline processing

● Standard isn’t open, little room (or enthusiasm) for

other vendors to implement

Page 9: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape - RenderScript

● Intelligent runtime load balances kernels

● Creates Java classes to interface with kernels

● Focused on performance portability

Neil Henning [email protected]

#pragma version(1) #pragma rs java_package_name(foo) rs_allocation gIn; rs_allocation gOut; rs_script gScript; void root(const char * in, char * out, const void * usr, uint32_t x, uint32_t y) { *out = *in; } void filter() { rsForEach(gScript, gIn, gOut, NULL); }

Context ctxt = /* … */; RenderScript rs = RenderScript.create(ctxt); ScriptC_foo script = new ScriptC_foo(rs, getResources(), R.raw.foo); Allocation in = Allocation.createSized(rs, Element.I8(rs), SIZE); Allocation out = Allocation.createSized(rs, Element.I8(rs), SIZE); script.set_gIn(in); script.set_gOut(out); script.set_gScript(script); script.invoke_filter();

● Only on Android

● Limited documentation & shortage of examples

● No real idea of feature roadmap

Page 10: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape – C++AMP

● Very well thought out single source approach

● Lovely use of C++ templates to capture type information,

array dimensions

● Great use of C++11 Lambda’s for capturing kernel intent

Neil Henning [email protected]

int in[SIZE], out[SIZE]; array_view<const int, 1> aIn(SIZE, in); array_view<int, 1> aOut(SIZE, out); aOut.discard_data(); parallel_for_each(aOut.extent, [=](index<1> idx) restrict(amp) { aOut[idx] = aIn[idx]; } ); // can access aOut[…] like normal

● Part of target community is really C++11 averse, need

convincing

● Limited low-level support

● Initial interest by community faded fast

● Xbox One will support C++AMP – watch this space

Page 11: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape - OpenCL

● Open standard with many contributors

● API puts control in developer hands

● Support on lots of heterogeneous platforms – not just GPUs!

Neil Henning [email protected]

void kernel foo(global int * a, global int * b) { int idx = get_global_id(0); a[idx] = b[idx]; }

// device, context, queue, in, out already created cl_program program = clCreateProgramWithSource(context, 1, fooAsStr, NULL, NULL); clBuildProgram(program, 1, &device, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, “foo”, NULL); // set kernel arguments clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 0, NULL, NULL);

● API is verbose, very very verbose!

● Steep learning curve for new developers

● Have to support diverse range of application types

Page 12: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

Neil Henning [email protected]

Modern systems have many compute-capable devices in them

Not unlike the fictitious system shown above!

Page 13: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

Neil Henning [email protected]

Scalar CPUs are the ‘normal’ target for programmers, easy

to target, easy to use

Mostly a fallback target for

compute currently

Page 14: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

Neil Henning [email protected]

Scalar CPUs are the ‘normal’ target for programmers, easy

to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if

kernel has vector types

Can auto-vectorize user kernels,

as vector units harder for ‘normal’ programmers to target

Page 15: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

Neil Henning [email protected]

Scalar CPUs are the ‘normal’ target for programmers, easy

to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if

kernel has vector types

Can auto-vectorize user kernels,

as vector units harder for ‘normal’ programmers to target

Digital Signal Processors (DSPs)

are a future target for the compute market

Can make no assumptions as to

what DSPs ‘look’ like

Page 16: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

Neil Henning [email protected]

Scalar CPUs are the ‘normal’ target for programmers, easy

to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if

kernel has vector types

Can auto-vectorize user kernels,

as vector units harder for ‘normal’ programmers to target

Digital Signal Processors (DSPs)

are a future target for the compute market

Can make no assumptions as to

what they ‘look’ like

GPUs are the reason we have

compute in the first place

GPUs do not forgive poor code like a CPU or even a DSP

could, require large arrays of work to utilize

Page 17: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Current Landscape

Neil Henning [email protected]

● Have to weigh up many competing concerns for languages

● Platform, operating system, device type, battery life, use case

Page 18: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

Page 19: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

● Compute approaches are not on all device and OS combinations

● No CUDA on AMD, RenderScript on iOS or C++AMP on Linux

● Have to support offline precise compute & time-bound online compute

● Very divergent targets/use cases/device types is problematic!

Neil Henning [email protected]

Page 20: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); ++idx) { a[idx] = 42 * b[idx]; } }

● What if loop count is always multiple of four?

Page 21: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

● What if loop count is always multiple of four?

● Can unroll the loop four times! void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); idx += 4) { a[idx + 0] = 42 * b[idx + 0]; a[idx + 1] = 42 * b[idx + 1]; a[idx + 2] = 42 * b[idx + 2]; a[idx + 3] = 42 * b[idx + 3]; } }

Page 22: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

● What if loop count is always multiple of four?

● Can unroll the loop four times!

● What if pointers a & b are sixteen byte aligned?

void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); idx += 4) { a[idx + 0] = 42 * b[idx + 0]; a[idx + 1] = 42 * b[idx + 1]; a[idx + 2] = 42 * b[idx + 2]; a[idx + 3] = 42 * b[idx + 3]; } }

Page 23: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

● What if loop count is always multiple of four?

● Can unroll the loop four times!

● What if pointers a & b are sixteen byte aligned?

● Can vectorize the loop body!

void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } }

Page 24: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

● What if loop count is always multiple of four?

● Can unroll the loop four times!

● What if pointers a & b are sixteen byte aligned?

● Can vectorize the loop body!

● Why does my code look so radically different now?

void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } }

Page 25: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

● What if loop count is always multiple of four?

● Can unroll the loop four times!

● What if pointers a & b are sixteen byte aligned?

● Can vectorize the loop body!

● Why does my code look so radically different now?

● Current languages force drastic developer interventions

void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } }

Page 26: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

Neil Henning [email protected]

● Existing languages (mostly) force developers to do coding

wizardry that is unnecessary

● Also no real feedback to developer as ‘main’ compute

target has highly secretive ISAs

● Don’t want to force vendors to reveal secrets, but do want

ability to influence kernel code generation

void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } }

Page 27: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

● Rely on vendors to provide tools to aid development

● Debuggers, profilers, static analysis all increasingly required

● Libraries can vastly decrease development time

● Rely solely on vendors to provide all these complicated pieces

Neil Henning [email protected]

Page 28: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

● Vendors already have lots of targets to support

● Every generation of devices need to test conformance

● Need to support compilers, graphics, compute, tools, list goes on!

● Why should the vendor be the only one taking the burden?

Neil Henning [email protected]

Page 29: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

● No one can agree on what is the ‘best’ approach

● Personal preference of developer/organization sways opinions

● Why not allow Lisp on a GPU? Lua on a DSP?

● Vendor doesn’t need extra headache of supporting these niche use cases

Neil Henning [email protected]

Page 30: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

What is wrong with the current landscape

● My pitch – let community support compute standards

● Take the approach of LLVM & Clang

● Vendor has to support lower standard on their hardware

● But allows community to support & innovate

Neil Henning [email protected]

Page 31: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

Neil Henning [email protected]

Page 32: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● First step – be able to compile language to a binary

● Can’t output real binary though

● Vendor doesn’t want to expose ISA

● Developer wants portability of compiled kernels

Neil Henning [email protected]

Page 33: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Need to use an Intermediate Representation (IR)

● Two approaches in development for this!

● HSA Intermediate Language (HSAIL)

● OpenCL Standard Portable Intermediate Representation (SPIR)

Neil Henning [email protected]

Page 34: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

● Language -> LLVM IR -> HSAIL

● Low level mapping onto hardware, more of a virtual ISA

than an IR

● HSAIL heavily in development

Neil Henning [email protected]

How to enable your language on GPUs

● Language -> LLVM IR -> SPIR

● Then pass SPIR to OpenCL runtime as binary

● Execute like normal OpenCL C Language kernel

● Provisional specification available!

Our Language

Our Language

Page 35: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

● HSA will provide a low-level runtime to interface

between HSA compiled binaries and OS

● HSAIL is being standardized and ratified

● Existing JIT’ed languages potential targets

Neil Henning [email protected]

How to enable your language on GPUs

Our Language

Our Language

● OpenCL SPIR will require a SPIR compliant OpenCL

implementation as target

● Can compile using LLVM, then use

clCreateProgramWithBinary, passing SPIR options

Page 36: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● At present, SPIR is only target we can investigate

● Intel has OpenCL drivers with provisional SPIR support

● Can use Clang -> LLVM -> SPIR, then use Intel’s OpenCL to consume SPIR

● Can take code that compiles to LLVM and run it on OpenCL

Neil Henning [email protected]

Page 37: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Various steps to getting your language working on GPUs with SPIR

● We’ll use Intel’s OpenCL SDK with provisional SPIR support;

1. Create a test harness to load a SPIR binary

2. Create a simple kernel using Intel’s SPIR compiler on host

3. Create a simple kernel using tip Clang (language OpenCL) targeting SPIR

4. Try other languages that compile to LLVM with SPIR target

Neil Henning [email protected]

Page 38: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

// some SPIR bitcode file const unsigned char spir_bc[spir_bc_length]; // already initialized platform, device & context for a SPIR compliant device cl_platform_id platform = ... ; cl_device device = ... ; cl_context context = … ; // create our program with our SPIR bitcode file cl_program program = clCreateProgramWithBinary( context, 1, &device, &spir_bc_length, &spir_bc, NULL, NULL); // build, passing arguments telling the compiler language is SPIR, and the SPIR standard we are using clBuildProgram(program, 1, &device, “–x spir –spir–std=1.2”, NULL, NULL);

How to enable your language on GPUs

Page 39: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

// already initialized memory buffers for our context cl_mem in_mem = ... ; cl_mem out_mem = ... ; // assume our kernel function from the spir kernel was called foo cl_kernel kernel = clCreateKernel(program, “foo”, NULL); // assume our kernel has one read buffer as first argument, and one write buffer as second clSetKernelArg(kernel, 0, sizeof(cl_mem), (void * )&in_mem); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void * )&out_mem);

How to enable your language on GPUs

Page 40: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

// already initialized command queue cl_command_queue queue = … ; cl_event write_event, run_event; clEnqueueWriteBuffer(queue, in_mem, CL_FALSE, 0, BUFFER_SIZE, &read_payload, 0, NULL, &write_event); const size_t size = BUFFER_SIZE / sizeof(cl_int); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 1, &write_event, &run_event); clEnqueueReadBuffer(queue, out_mem, CL_TRUE, 0, BUFFER_SIZE, &result_payload, 1, &run_event, NULL);

How to enable your language on GPUs

Page 41: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Now, create a simple OpenCL kernel

Neil Henning [email protected]

void kernel foo(global int * in, global int * out) { out[get_global_id(0)] = in[get_global_id(0)]; }

● And use Intel’s command line (or GUI!) tool to build

Ioc32 –cmd=build –input foo.cl –spir32=foo.bc

Page 42: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Next we point the buffer for our SPIR kernel at the generated SPIR kernel

● And it fails…?

● Turns out Intel’s OpenCL runtime doesn’t like us telling them they are building

SPIR!

● Simply remove “–x spir –spir–std=1.2” from the build options and voila!

Neil Henning [email protected]

Page 43: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Next step – use tip Clang to build our foo.cl kernel

● Compiles ok, but when we run it fails…?

● So Clang generated SPIR bitcode file could very well not work

● We’ll take a look at the readable IR for the Intel & Clang compiled kernels

Neil Henning [email protected]

clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.cl –o foo.bc

Page 44: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

How to enable your language on GPUs

; ModuleID = 'ex.cl' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" ; Function Attrs: nounwind define void @foo(i32 addrspace(1)* nocapture readonly %a, i32 addrspace(1)* nocapture %b) #0 { entry: %0 = load i32 addrspace(1)* %a, align 4, !tbaa !2 store i32 %0, i32 addrspace(1)* %b, align 4, !tbaa !2 ret void }

attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-realign-stack" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" } !opencl.kernels = !{!0} !llvm.ident = !{!1} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo} !1 = metadata !{metadata !"clang version 3.4 (trunk)"} !2 = metadata !{metadata !3, metadata !3, i64 0} !3 = metadata !{metadata !"int", metadata !4, i64 0} !4 = metadata !{metadata !"omnipotent char", metadata !5, i64 0} !5 = metadata !{metadata !"Simple C/C++ TBAA"}

●Clang Output

Page 45: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

How to enable your language on GPUs

; ModuleID = 'ex.bc' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" define spir_kernel void @foo(i32 addrspace(1)* %a, i32 addrspace(1)* %b) nounwind { %1 = alloca i32 addrspace(1)*, align 4 %2 = alloca i32 addrspace(1)*, align 4 store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4 store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4 %3 = load i32 addrspace(1)** %1, align 4 %4 = load i32 addrspace(1)* %3, align 4 %5 = load i32 addrspace(1)** %2, align 4 store i32 %4, i32 addrspace(1)* %5, align 4 ret void }

!opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{}

●IOC Output

Page 46: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

How to enable your language on GPUs

; ModuleID = 'ex.cl' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" define spir_kernel void @foo(i32 addrspace(1)* %a, i32 addrspace(1)* %b) nounwind { %1 = alloca i32 addrspace(1)*, align 4 %2 = alloca i32 addrspace(1)*, align 4 store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4 store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4 %3 = load i32 addrspace(1)** %1, align 4 %4 = load i32 addrspace(1)* %3, align 4 %5 = load i32 addrspace(1)** %2, align 4 store i32 %4, i32 addrspace(1)* %5, align 4 ret void }

!opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{}

●IOC Output

Page 47: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● So the metadata is different!

● We could fix Clang to produce the right metadata…?

● Or just hack around!

● Lets use Intel’s compiler to generate a stub function

● Then we can use an extern function defined in our Clang module!

Neil Henning [email protected]

Page 48: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

How to enable your language on GPUs

extern int doSomething(int a); void kernel foo(global int * in, global int * out) { int id = get_global_id(0); out[id] = doSomething(in[id]); }

int doSomething(int a) { return a; }

Page 49: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● And it fails…?

● Intel’s compiler doesn’t like extern functions!

● We’ve already bodged it thus far…

● So lets continue!

Neil Henning [email protected]

Int __attribute__((weak)) doSomething(int a) {} void kernel foo(global int * in, global int * out) { int id = get_global_id(0); out[id] = doSomething(in[id]); }

Page 50: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● More than a little nasty…

● Relies on Clang extension to declare function weak within OpenCL

● Relies on Intel using Clang and allowing extension

● But it works!

● Can build both the Intel stub code & the Clang actual code

● Then use llvm-link to pull them together!

Neil Henning [email protected]

Page 51: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● So now we can compile two OpenCL kernels, link them together, and run it

● What is next? Want to enable your language!

● What about using Clang, but using a different language?

● C & C++ come to mind!

Neil Henning [email protected]

Page 52: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Use a simple C file

Neil Henning [email protected]

int doSomething(int a) { return a; }

● And use Clang to compile it

clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.c –o foo.bc

Page 53: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Or a simple C++ file!

Neil Henning [email protected]

extern “C” int doSomething(int a); template<typename T> T templatedSomething(const T t) { return t; } int doSomething(int a) { return templatedSomething(a); }

Page 54: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Lets have some real C++ code

● Use features that OpenCL doesn’t provide us

We’ll do a matrix multiplication in C++

Use classes, constructors, templates

Neil Henning [email protected]

Page 55: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

typedef float __attribute__((ext_vector_type(4))) float4; typedef float __attribute__((ext_vector_type(16))) float16; float __attribute__((overloadable)) dot(float4 a, float4 b); template<typename T, unsigned int WIDTH, unsigned int HEIGHT> class Matrix { typedef T __attribute__((ext_vector_type(WIDTH))) RowType; RowType rows[HEIGHT]; public: Matrix() {} template<typename U> Matrix(const U & u) { __builtin_memcpy(&rows, &u, sizeof(U)); } RowType & operator[](const unsigned int index) { return rows[index]; } const RowType & operator[](const unsigned int index) const { return rows[index]; } };

How to enable your language on GPUs

Page 56: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

template<typename T, unsigned int WIDTH, unsigned int HEIGHT> Matrix<T, WIDTH, HEIGHT> operator *(const Matrix<T, WIDTH, HEIGHT> & a, const Matrix<T, WIDTH, HEIGHT> & b) { Matrix<T, HEIGHT, WIDTH> bShuffled; for(unsigned int h = 0; h < HEIGHT; h++) for(unsigned int w = 0; w < WIDTH; w++) bShuffled[w][h] = b[h][w]; Matrix<T, WIDTH, HEIGHT> result; for(unsigned int h = 0; h < HEIGHT; h++) for(unsigned int w = 0; w < WIDTH; w++) result[h][w] = dot(a[h], bShuffled[w]); return result; }

How to enable your language on GPUs

Page 57: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

extern “C” float16 doSomething(float16 a, float16 b); float16 doSomething(float16 a, float16 b) { Matrix<float, 4, 4> matA(a); Matrix<float, 4, 4> matB(b); Matrix<float, 4, 4> mul = matA * matB; float16 result = (float16 )0; result.s0123 = mul[0]; result.s4567 = mul[1]; result.s89ab = mul[2]; result.scdef = mul[3]; return result; }

How to enable your language on GPUs

Page 58: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

ex5.vcxproj -> E:\AMDDeveloperSummit2013\build\Example5\Debug\ex5.exe Found 2 platforms! Choosing vendor 'Intel(R) Corporation'! Found 1 devices! SPIR file length '3948' bytes! [ 0.0, 1.0, 2.0, 3.0] * [ 16.0, 15.0, 14.0, 13.0] = [ 40.0, 34.0, 28.0, 22.0] [ 4.0, 5.0, 6.0, 7.0] * [ 12.0, 11.0, 10.0, 9.0] = [200.0, 178.0, 156.0, 134.0] [ 8.0, 9.0, 10.0, 11.0] * [ 8.0, 7.0, 6.0, 5.0] = [360.0, 322.0, 284.0, 246.0] [ 12.0, 13.0, 14.0, 15.0] * [ 4.0, 3.0, 2.0, 1.0] = [520.0, 466.0, 412.0, 358.0]

How to enable your language on GPUs

● And when we run it…

● Success!

Page 59: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Neil Henning [email protected]

How to enable your language on GPUs

!opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{}

● The least you need to target a GPU;

● Generate correct LLVM IR with SPIR

metadata

● Or at least generate LLVM IR and

use the approach we used to

combine Clang and IOC generated

kernels

Page 60: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● Porting C/C++ libraries to SPIR requires a little more work

● The data pointed to by ‘a’ will by default be put in the private address space

● But a straight conversion to SPIR needs all data in global address space

● Means that any porting of existing code could be quite intrusive

Neil Henning [email protected]

int foo(int * a) { return *a; }

Page 61: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

How to enable your language on GPUs

● To target your language at GPUs

● Need to deal with distinct address spaces

● Need to be able to segregate work into parallel chunks

● Have to ban certain features that don’t work with compute

● Language could also provide an API onto OpenCL SPIR builtins

● But with OpenCL SPIR it is now possible to make any language work on a GPU!

Neil Henning [email protected]

Page 62: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Developing tools for GPUs

Neil Henning [email protected]

Page 63: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Developing tools for GPUs

● Tools increasingly required to support development

● Even having printf (which OpenCL 1.2 added) is novel!

● But with increasingly complex code better tools needed

● Main three are debuggers, profilers and compiler-tools

Neil Henning [email protected]

Page 64: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Developing tools for GPUs

● Debuggers for compute are difficult for non-vendor to develop

● Codeplay has developed such tools on top of compute standards

● Problem is bedrock for these tools can change at any time

● Hard to beat vendor-owned approach that has lower-level access

Neil Henning [email protected]

Page 65: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

● Codeplay are pushing hard for HSA to have features

that aid tool development

● Debuggers are much easier with instruction

support, debug info, change registers, call stacks

Neil Henning [email protected]

Developing tools for GPUs

● OpenCL SPIR harder to create debugger for without

vendor support

● Can we standardize a way to debug OpenCL SPIR,

or allow debugging via emulation of SPIR?

Our Language

Our Language

Page 66: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Developing tools for GPUs

● Profilers require superset of debugger feature-set

● Need to be able to trap kernels at defined points

● Accurate timings only other requirement beyond debugger support

● More fun when we go beyond performance, and measure power

Neil Henning [email protected]

Page 67: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Developing tools for GPUs

● HSA and OpenCL SPIR both good profiler targets

● Could split SPIR kernels into profiling sections

● Then use existing timing information in OpenCL

● HSA will only require debugger features we are pushing for

Neil Henning [email protected]

Page 68: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Developing tools for GPUs

● Compiler tools consist of optimizers and analysis

● Both HSA and OpenCL SPIR being based on LLVM enable this!

● We as compiler experts can aid existing runtimes

● You as developers can add optimizations & analyse your kernels!

Neil Henning [email protected]

Page 69: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Conclusion

Neil Henning [email protected]

Page 70: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Conclusion

● With the rise of open standards, compute is increasingly easy

● With HSA & OpenCL SPIR hardware is finally open to us!

● Just need standards to ratify, mature & be available on hardware!

● Next big push into compute is upon us

Neil Henning [email protected]

Page 71: PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

Questions?

Neil Henning [email protected]

Can also catch me on twitter @sheredom