Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a...

31
Current State of Open Source GPGPU Tom Stellard XDC 2017 September 20, 2017

Transcript of Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a...

Page 1: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

Current State of Open Source GPGPU

Tom StellardXDC 2017September 20, 2017

Page 2: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

2

const char * program_src ="__kernel\n""void pi(__global float * out) \n""{\n"" out[0] = 3.14159f;\n""}\n";

int main(int argc, char ** argv) {

// Insert device setup code here.

program = clCreateProgramWithSource(context,1,&program_src,NULL,&error); error = clBuildProgram(program,1,&device_id,NULL,NULL,NULL); // Insert buffer allocation and program execution code here.

}

const char * program_src ="__kernel\n""void pi(__global float * out) \n""{\n"" out[0] = 3.14159f;\n""}\n";

int main(int argc, char ** argv) {

// Insert device setup code here.

program = clCreateProgramWithSource(context,1,&program_src,NULL,&error); error = clBuildProgram(program,1,&device_id,NULL,NULL,NULL); // Insert buffer allocation and program execution code here.

}

OpenCL

Page 3: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

3

Clover

● Clover – hardware independent OpenCL API implementation.

● Gallium Drivers – Hardware dependent userspace GPU drivers.

● libClang – OpenCL C language frontend.

● libLLVM – Hardware dependent codegen.

● libclc – LLVM IR bitcode library with device builtin functions (e.g. sin, cos, min, max).

GalliumDrivers

CloverlibClanglibLLVM

OpenCLProgramlibclc

OpenCL C

GPU Code/IR

Mesa

LLVM

Overview

Page 4: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

4

Clover

● OpenCL 1.1 Supported.

● Missing OpenCL 1.2 Functions:

● clEnqueueMigrateMemObjects(), clGetKernelArgInfo(), clEnqueueFillBuffer(), clEnqueueFillImage()

● OpenCL C 1.2 supported via clang.

API Status

GalliumDrivers

CloverlibClanglibLLVM

OpenCLProgramlibclc

OpenCL C

GPU Code/IR

Mesa

LLVM

Page 5: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

5

Clover

● Missing image support in LLVM/libclc.

● Open Source CTS making development easier.

● cl_khr_fp16 almost done.

● Benefiting from shared compiler with ROCm.

Gallium - AMD

GalliumDrivers

CloverlibClanglibLLVM

OpenCLProgramlibclc

OpenCL C

GPU Code/IR

Mesa

LLVM

Page 6: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

6

Clover

● Work in progress OpenCL C support.

● OpenCL C → SPIR-V → NV machine code.

● Compute API implementation mostly in place.

Gallium - nouveau

GalliumDrivers

CloverlibClanglibLLVM

OpenCLProgramlibclc

OpenCL C

GPU Code/IR

Mesa

LLVM

Page 7: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

7

libclcStatus

● nvptx – basic support.

● amdgpu - well supported.

● Missing OpenCL 1.2 functions: maxmag, minmag, nan, powr, remainder, remquo, rootn, tanpi, half_cos, half_divide, half_exp, half_exp2, half_exp10, half_log, half_log2, half_log10, half_powr, half_recip, half_rsqrt, half_sin, half_sqrt, native_log10

● Missing Image functions. GalliumDrivers

CloverlibClanglibLLVM

OpenCLProgramlibclc

OpenCL C

GPU Code/IR

Mesa

LLVM

Page 8: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

8

Pocl

● Supports LLVM 5.0.

● Approaching OpenCL 1.2 completeness.

● New Cuda backend.

Pocl Devices

PocllibClanglibLLVM

OpenCLProgram

PoclBuiltin Lib

OpenCL C

GPU Code/IR

PoclLLVM

HSA TCECuda

Status

Pocl Devices

PocllibClanglibLLVM

OpenCLProgram

PoclBuiltin Lib

OpenCL C

GPU Code/IR

PoclLLVM

HSA TCECuda

Page 9: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

9

Beignet

● Supports LLVM 5.0.

● OpenCL 2.0.

● Passed conformance for 1.2 in 2015.Beignet

libClang

OpenCLProgram

BeignetBuiltin Lib

OpenCL C

LLVM IR

BeignetLLVM

Status

Page 10: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

10

ROCm OpenCLStatus

● Supports ROCm compatible AMD hardware.

● OpenCL 1.2 API with OpenCL C 2.0.

● Function call support in progress.

● Compiler optimizations for LLVM AMDGPU backend.

ROCmRuntime

ROCmOpenCL

libClanglibLLVM

OpenCLProgram

ROCmDevice Lib

OpenCL C

GPU Code

ROCmLLVM

Page 11: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

11

Clover-GSOC11Status

● Development started in 2009.

● Goal was OpenCL over gallium API.

● Picked up as Google Summer of Code Project in 2011.

● Gallium support dropped.

● Focused on supporting CPU targets.

● Development stopped after 2011.

● Inspired clover in mesa, but the mesa code is a complete rewrite.

Inspiration for Clover

CloverlibClanglibLLVM

OpenCLProgram

CloverBuiltin Lib

OpenCL C

CPU Code

Clover-GSOC11LLVM

Page 12: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

12

TI-OpenCLStatus

OpenCL for DSP

● Based on Clover-GSOC2011.

● Still under active development.

● Borrowed device library code from libclc/pocl.

● OpenCL 1.1

● Some CPU support.

TI-OpenCL

libClanglibLLVM

OpenCLProgram

TI-OpenCLBuiltin Lib

OpenCL C

DSP Code ?

Clover-GSOC11LLVM

Page 13: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

13

ShamrockStatus

Fork of TI-OpenCL

● Improved CPU support in TI-OpenCL.

● CPU support for OpenCL 1.2.

● No commits for one year.Shamrock

libClanglibLLVM

OpenCLProgram

ShamrockBuiltin Lib

OpenCL C

CPU Code

Clover-GSOC11LLVM

Page 14: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

14

OpenCL Overview

● 4 Different device library implementations.

● 3 Implementations supporting various subsets of AMD hardware.

● Beignet only conformant implementation (1.2).

Project Version Hardware

Clover 1.1 AMD

Pocl 1.2 CPU, NVIDIA1, AMD 2 TCE/TTA

Beignet 2.0 Intel

ROCm OpenCL 1.2 AMD3

Lots of fragmentation

1 Requires proprietary drivers2 HSA Compatible Hardware3 ROCm Compatible Hardware

Page 15: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

15

OpenCL Overview

● Missing features lead to new implementation and prevent consolidation:

● Clover missing CPU support.

● POCL missing interface to open GPU drivers.

● Clover/Gallium not supporting Intel GPUs.

● Well-tested closed implementation open sourced

● ROCm OpenCL

Project Version Hardware

Clover 1.1 AMD

Pocl 1.2 CPU, NVIDIA1, AMD 2 TCE/TTA

Beignet 2.0 Intel

ROCm OpenCL 1.2 AMD3

Why so much fragmentation ?

1 Requires proprietary drivers2 HSA Compatible Hardware3 ROCm Compatible Hardware

Page 16: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

16

OpenCL Overview

● OpenCL ICD makes it easy for multiple implementations to co-exist.

● Leveraging LLVM/Clang for OpenCL C support makes new implementations easier.

Project Version Hardware

Clover 1.1 AMD

Pocl 1.2 CPU, NVIDIA1, AMD 2 TCE/TTA

Beignet 2.0 Intel

ROCm OpenCL 1.2 AMD3

Other incentives for fragmentation:

1 Requires proprietary drivers2 HSA Compatible Hardware3 ROCm Compatible Hardware

Page 17: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

17

OpenCL Vendor Status

● Current OpenCL version is 2.2, spec released May 2017.

● OpenCL 1.2 spec was released in November 2011.

● NVIDIA achieved 1.2 conformance in May 2015.

● Most SOC vendors support 1.2.

● Only ARM and Qualcomm only SOC vendors to support 2.0.

● 1 successful conformance submission in 2017:

● ARM Mali OpenCL 2.0

Vendor Highest Conformant Version

Intel 2.1

AMD 2.0

NVIDIA 1.2

Vendor adoption has slowed for recent versions.

Page 18: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

18

OpenCL Implementation Rates

AMD NVIDIA Intel GPU

-60

-40

-20

0

20

40

60

1.01.11.22.02.12.2

Mo

nth

s to

Co

nfo

rma

nce

Page 19: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

19

OpenCLOpen Source Outlook

● Slow vendor adoption has been a positive:

● Open source implementations not that far behind.

● Fragmentation still an issue going forward.

● Incentive for implementing OpenCL 2.0+ may be low.

OpenCL 1.0 → 2.2

Page 20: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

20

OpenCLFuture Direction

‘OpenCL Next’ discussed at IWOCL 2017

● Convergence of OpenCL and Vulkan API.

● Not clear what this means.

● Fine grained feature capabilities.

● API subsets for different devices / use cases.

● More language flexibility.

Page 21: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

21

OpenCLOpen Source Outlook

‘OpenCL Next’ could be good for Open Source implementations if:

● Full API implementation not required for conformance.

● Shared effort with Vulkan.

● Compiler frontends not part of API.

Page 22: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

22

SPIR-V

● Khronos SPIR-V LLVM

● Bi-directional SPIR-V to LLVM IR converter.

● Fork of LLVM 3.8.

● clspv

● OpenCL C to SPIR-V.

● Stand-alone project.

● Uses libclang to compile OpenCL C to LLVM IR.

● Converts LLVM-IR to SPIR-V.

Status

Translators to/from SPIR-V

OpCapbility Shader OpMemoryModel Logical Simple OpEntryPoint GLCompute %3 "main" OpExecutionMode %3 LocalSize 64 64 1%1 = OpTypeVoid%2 = OpTypeFunction %1%3 = OpFunction %1 None %2%4 = OpLabel OpReturn OpFunctionEnd

OpCapbility Shader OpMemoryModel Logical Simple OpEntryPoint GLCompute %3 "main" OpExecutionMode %3 LocalSize 64 64 1%1 = OpTypeVoid%2 = OpTypeFunction %1%3 = OpFunction %1 None %2%4 = OpLabel OpReturn OpFunctionEnd

https://github.com/KhronosGroup/SPIRV-Tools/blob/master/syntax.md

Page 23: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

23

ROCmstatus

GPU Runtime / Toolchain for AMD hardware

● Full stack: Kernel Drivers, Userspace Drivers, toolchain.

● Completely Open Source.

● Low-level compute API for AMD.

● Clang based toolchain:

● Direct-to-ISA compilation.

● Assembler / Disassembler.

● HCC compiler for C++AMP/HCC language.

ROCmRuntime

GPGPUAPI

libClanglibLLVM

ROCmDevice Lib

High-level Language

GPU Code

ROCm

LLVM

Page 24: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

24

OpenMPStatus

Open Source Runtimes

● GCC Runtime (libgomp).

● Clang Runtime (libomp).

● Both runtimes have Cuda backends.

#include <omp.h>#include <stdio.h>#include <stdlib.h>

int main (int argc, char *argv[]) {int nthreads, tid;

/* Fork a team of threads giving them their own copies of variables */#pragma omp parallel private(nthreads, tid) {

/* Obtain thread number */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and disband */}

#include <omp.h>#include <stdio.h>#include <stdlib.h>

int main (int argc, char *argv[]) {int nthreads, tid;

/* Fork a team of threads giving them their own copies of variables */#pragma omp parallel private(nthreads, tid) {

/* Obtain thread number */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and disband */}

Page 25: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

25

Cuda

● Low-level: Driver API.

● Higher-level: Runtime API.

● Compiler inserts API calls into program for launching kernel.

Overview

Single-source GPGPU

#include <iostream>

__global__ void axpy(float a, float* x, float* y) { y[threadIdx.x] = a * x[threadIdx.x];}

int main(int argc, char* argv[]) { const int kDataLen = 4;

float a = 2.0f; float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; float host_y[kDataLen];

// Copy input data to device. float *device_x, *device_y; cudaMalloc(&device_x, kDataLen * sizeof(float)); cudaMalloc(&device_y, kDataLen * sizeof(float)); cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), cudaMemcpyHostToDevice);

// Launch the kernel. axpy<<<1, kDataLen>>>(a, device_x, device_y);

// Copy output data to host. cudaDeviceSynchronize(); cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), cudaMemcpyDeviceToHost);

// Print the results. for (int i = 0; i < kDataLen; ++i) { std::cout << "y[" << i << "] = " << host_y[i] << "\n"; }

cudaDeviceReset(); return 0;}

#include <iostream>

__global__ void axpy(float a, float* x, float* y) { y[threadIdx.x] = a * x[threadIdx.x];}

int main(int argc, char* argv[]) { const int kDataLen = 4;

float a = 2.0f; float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; float host_y[kDataLen];

// Copy input data to device. float *device_x, *device_y; cudaMalloc(&device_x, kDataLen * sizeof(float)); cudaMalloc(&device_y, kDataLen * sizeof(float)); cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), cudaMemcpyHostToDevice);

// Launch the kernel. axpy<<<1, kDataLen>>>(a, device_x, device_y);

// Copy output data to host. cudaDeviceSynchronize(); cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), cudaMemcpyDeviceToHost);

// Print the results. for (int i = 0; i < kDataLen; ++i) { std::cout << "y[" << i << "] = " << host_y[i] << "\n"; }

cudaDeviceReset(); return 0;}

Page 26: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

26

Cuda

● gdev

● Kernel module, userspace API.

● Last public commit 3 years ago.

● Clang CUDA frontend

● AMD HIP

Status

Open Source Work

#include <iostream>

__global__ void axpy(float a, float* x, float* y) { y[threadIdx.x] = a * x[threadIdx.x];}

int main(int argc, char* argv[]) { const int kDataLen = 4;

float a = 2.0f; float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; float host_y[kDataLen];

// Copy input data to device. float *device_x, *device_y; cudaMalloc(&device_x, kDataLen * sizeof(float)); cudaMalloc(&device_y, kDataLen * sizeof(float)); cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), cudaMemcpyHostToDevice);

// Launch the kernel. axpy<<<1, kDataLen>>>(a, device_x, device_y);

// Copy output data to host. cudaDeviceSynchronize(); cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), cudaMemcpyDeviceToHost);

// Print the results. for (int i = 0; i < kDataLen; ++i) { std::cout << "y[" << i << "] = " << host_y[i] << "\n"; }

cudaDeviceReset(); return 0;}

#include <iostream>

__global__ void axpy(float a, float* x, float* y) { y[threadIdx.x] = a * x[threadIdx.x];}

int main(int argc, char* argv[]) { const int kDataLen = 4;

float a = 2.0f; float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; float host_y[kDataLen];

// Copy input data to device. float *device_x, *device_y; cudaMalloc(&device_x, kDataLen * sizeof(float)); cudaMalloc(&device_y, kDataLen * sizeof(float)); cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), cudaMemcpyHostToDevice);

// Launch the kernel. axpy<<<1, kDataLen>>>(a, device_x, device_y);

// Copy output data to host. cudaDeviceSynchronize(); cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), cudaMemcpyDeviceToHost);

// Print the results. for (int i = 0; i < kDataLen; ++i) { std::cout << "y[" << i << "] = " << host_y[i] << "\n"; }

cudaDeviceReset(); return 0;}

Page 27: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

27

CUDAFuture Directions in Open Source

● Revive gdev.

● Gallium state tracker.

● Conversion tools to/from PTX.

● Clang as a Cuda frontend.

● Open source replacement for Cuda toolchain

Page 28: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

28

Future DirectionsSummary

Common trends for GPGPU API

● Smaller, low level APIs:

● OpenCL Next

● ROCm

● CUDA Device API

● IR as driver input.

● Device language specs not part of API definition.

Page 29: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

29

Pekka Jääskeläinen

Jan Vesely

Aaron Watry

Pierre Moreau

Thank YouSlide Contributors

Page 30: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

30

References

● https://www.khronos.org/conformance/adopters/conformant-products

● https://www.khronos.org/assets/uploads/developers/library/2017-iwocl/IWOCL-Neil-Trevett-Keynote_May17.pdf

● https://github.com/shinpei0208/gdev

● https://cgit.freedesktop.org/mesa/mesa/

● https://github.com/pocl/pocl

● https://cgit.freedesktop.org/beignet/

Page 31: Current State of Open Source GPGPU - X.Org · Open Source Outlook Slow vendor adoption has been a positive: Open source implementations not that far behind. Fragmentation still an

31

References

● http://downloads.ti.com/mctools/esd/docs/opencl/index.html

● https://git.ti.com/opencl/ti-opencl

● https://git.linaro.org/gpgpu/shamrock.git

● https://github.com/RadeonOpenCompute/

● http://docs.nvidia.com/cuda/cuda-runtime-api/index.html

● https://github.com/google/clspv

● https://github.com/KhronosGroup/SPIRV-LLVM