The GPGPU Continuum

Post on 13-May-2015

799 views 2 download

Preview:

Click to see full reader

Tags:

description

This is a presentation I gave on last GPGPU workshop we did on April 2013. The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.

Transcript of The GPGPU Continuum

Ofer Rosenberg

The GPU continuum workshop, April 25 2013

THE GPGPU CONTINUUM

CONTENT

• Intel’s Compute Continuum

• GPGPU Evolution

• The GPGPU Continuum

• Mobile GPGPU challenges

• GPGPU Continuum challenges

• Towards the Continuum

INTEL’S “COMPUTE CONTINUUM” FROM IDC 2010

GPGPU EVOLUTION

2004 – Stanford University: Brook for GPUs

2006 – AMD releases CTM

NVIDIA releases CUDA

2008 – OpenCL 1.0 released

G80 – 346 GFLOPS R580 – 375 GFLOPS

GPGPU EVOLUTION

Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1

1,024 Intel Xeon E5450 CPUs

5,120 Radeon 4870 X2 GPUs

Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A

14,336 Xeon X5670 CPUs

7,168 Nvidia Tesla M2050 GPUs

Tianhe-1 : 563 TFLOPS

Tianhe-1A : 2577 TFLOPS

Source: http://www.top500.org/lists/

GPGPU EVOLUTION

2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320)

Nexus 10 (ARM Mali T604)

Android 4.2 adds GPU support for Renderscript

2014 – NVIDIA Tegra 5 will support CUDA

2013 – GPGPU Continuum becomes a reality

THE GPGPU CONTINUUM

Apple A6 GPU

25 GFLOPS

< 2W

ORNL TITAN SC

27 PFLOPS

8200 KW

AMD G-T16R

46 GFLOPS*

4.5W

NVIDIA GTX Titan

4500 GFLOPS

250W

Intel i7-3770

511 GFLOPS*

77W* GFLOPS of CPU+GPU

Take Intel’s vision on Compute Continuum, and aspire for that on the GPGPU continuum:

A common ecosystem

built on a common (SW) architecture

INTRO TO LEADING MOBILE GPU VENDORS

Qualcomm Adreno 320

• Part of Snapdragon S4

• Unified Shader

• SIMD4 ?

• Supports OpenCL 1.1 (E)

• 50 GFlops

http://kyokojap.myweb.hinet.net/gpu_gflops/

Imagination PowerVR 543

• Apple, Samsung, Motorola,

Intel

• Unified Shaders

• Supports OpenCL 1.1 (E)

• 38 Gflops (Apple’s MP4 ver)

ARM Mali T604

• 4 Cores

• Multiple “pipes” per core

• Supports OpenCL 1.1

• 68 GFlops

Vivante CG4000

• Unified Shaders

• 4 Cores, SIMD4 each

• Supports OpenCL 1.2

• 48 Gflops

NVIDIA Tegra 4

• 6 X 4-wide Vertex shaders

• 4 X 4-wide Pixel Shaders

• No GPGPU support

• 74 GFLOPS

MOBILE GPGPU CHALLENGES

• Many Different GPU Architectures

• Optimizing for each sets high bar on development costs

• Development Tools

• Immature (stability, performance)

• No common SDK / Debugger / Profiler (different per vendor)

• Ecosystem

• Lack of libraries, wizards, middleware Slow & expensive development

• Distribution Model

• Driver updates are part of OS distribution (no more per-month updates…)

• End users are less likely to update version higher standards on stability &

performance of driver release

• Security – the unspoken issue (hole) …

GPGPU CONTINUUM CHALLENGES

• Many Different GPU Architectures

• Optimizing for each sets high bar on development costs

• Development Tools

• Immature (stability, performance)

• No common SDK / Debugger / Profiler (different per vendor)

• Ecosystem

• Lack of libraries, wizards, middleware Slow & expensive development

• Distribution Model

• End users are less likely to update version higher standards on stability &

performance of driver release

• Security – the unspoken issue (hole) …

These challenges are a barrier to GPGPU adoption across the continuum

TOWARDS THE CONTINUUM (1) - LANGUAGES

• Welcome to the GPGPU (SW) jungle …

GPU

TOWARDS THE CONTINUUM (1) - LANGUAGES

• Welcome to the GPGPU (SW) jungle …

OpenCL

CUDADirect

Compute

Render

ScriptGPU

TOWARDS THE CONTINUUM (1) - LANGUAGES

• Welcome to the GPGPU (SW) jungle …

OpenCL

CUDADirect

Compute

Render

Script

OpenACC

C++ AMP

Fortran

Aparapi

(Java)

PyOpenCL

NumbaPro

(Python)

WebCL

GPU

TOWARDS THE CONTINUUM (1) - LANGUAGES

• Welcome to the GPGPU (SW) jungle …

OpenCL

CUDADirect

Compute

Render

Script

OpenACC

C++ AMP

Fortran

Aparapi

(Java)

PyOpenCL

NumbaPro

(Python)

WebCL

GPU

A Jungle of languages… but are these the right ones ?

TOWARDS THE CONTINUUM (1) - LANGUAGES

• Current GPGPU languages are C/C++

based

• There are “binding” to Python, Java,

Javascript – but kernels are still C/C++

• Current developers trends:

• Managed languages (Java , C#)

• Scripting languages (Python, PHP)

• Higher abstraction & manageability:

• More room for tools to excel on

optimization

• Mitigate difference between GPU

architectures

Data from CodeEval.com, based on 100K+ code samples

https://sites.google.com/site/pydatalog/pypl/PyPL-PopularitY-of-

Programming-Language

GPGPU languages need to evolve

TOWARDS THE CONTINUUM (2) - SOFTWARE STACK

LLVM IR

CUDA

Vendor X IL

Vendor X GPU

TOWARDS THE CONTINUUM (2) - SOFTWARE STACK

LLVM IR

OpenCL CUDA

Vendor X IL

Vendor X GPU

TOWARDS THE CONTINUUM (2) - SOFTWARE STACK

• Most GPGPU languages already use

LLVM compilation framework

• Slight “flavors” of LLVM IR

• Most languages also posses similar

“API capabilities” set

LLVM IR

Render

ScriptOpenCL CUDA

OpenACC

Vendor X IL

Vendor X GPU

TOWARDS THE CONTINUUM (2) - SOFTWARE STACK

• Most GPGPU languages already use

LLVM compilation framework

• Slight “flavors” of LLVM IR

• Most languages also posses similar

“API capabilities” set

• Defining a common stack based on

LLVM & common API will:

• Improve the compiler

• Increase driver quality & stability

• Enable unified debugger / profiler

• …

LLVM IR

Render

ScriptOpenCL CUDA

OpenACC

Vendor X IL

Vendor X GPU

Define GPGPU Virtual Machine based on LLVM

TAKEAWAYS

• GPGPU Continuum is here - from Mobile devices to HPC

• Vision: A common ecosystem built on a common (SW)

architecture

• Challenges: many architectures, immature tools, ecosystem

QUESTIONS

• Q: What about “Heterogeneous Computing” ?

• A: Go back, replace each “GPGPU” with “Heterogeneous

Computing” – and it all fits…

• More ?

SOME SOURCES:

• http://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivy-bridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-news-in-Ivy-Bridge.html

• http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012/PDP2833.pdf

• http://www.anandtech.com/show/6787/nvidia-tegra-4-architecture-deep-dive-plus-tegra-4i-phoenix-hands-on/5

• http://www.anandtech.com/show/5077/arms-malit658-gpu-in-2013-up-to-10x-faster-than-mali400

• http://www.chipdesignmag.com/pallab/2011/06/30/arm-mali-gpu-unifying-graphics-across-platforms/

• http://en.wikipedia.org/wiki/Adreno#Renaming_to_Adreno

• http://en.wikipedia.org/wiki/PowerVR#Series_5_.28SGX.29

• http://en.wikipedia.org/wiki/Mali_(GPU)

• http://johndayautomotivelectronics.com/?p=12412

• http://www.cnx-software.com/2013/01/19/gpus-comparison-arm-mali-vs-vivante-gcxxx-vs-powervr-sgx-vs-nvidia-geforce-ulp/

• http://www.brightsideofnews.com/print/2013/1/30/rise-of-vivante-fastest-tablet-gpu-on-the-market.aspx

• https://www.uplinq.com/2012/schedule/accelerating-your-android-application-renderscript-and-llvm-0

• http://www.androidauthority.com/adreno-320-features-performance-benchmarks-103269/

http://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivy-bridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-news-in-Ivy-Bridge.html

http://elrond.informatik.tu-freiberg.de/papers/WorldComp2012/PDP2833.pdf

http://www.anandtech.com/show/6787/nvidia-tegra-4-architecture-deep-dive-plus-tegra-4i-phoenix-hands-on/5

http://www.anandtech.com/show/5077/arms-malit658-gpu-in-2013-up-to-10x-faster-than-mali400

http://www.chipdesignmag.com/pallab/2011/06/30/arm-mali-gpu-unifying-graphics-across-platforms/

http://en.wikipedia.org/wiki/Adreno

http://en.wikipedia.org/wiki/PowerVR

http://en.wikipedia.org/wiki/Mali_(GPU)

http://johndayautomotivelectronics.com/?p=12412

http://www.cnx-software.com/2013/01/19/gpus-comparison-arm-mali-vs-vivante-gcxxx-vs-powervr-sgx-vs-nvidia-geforce-ulp/

http://www.brightsideofnews.com/print/2013/1/30/rise-of-vivante-fastest-tablet-gpu-on-the-market.aspx

https://www.uplinq.com/2012/schedule/accelerating-your-android-application-renderscript-and-llvm-0

http://www.androidauthority.com/adreno-320-features-performance-benchmarks-103269/

The GPGPU Continuum

Technology

Transcript of The GPGPU Continuum

GPGPU Accelerated Database

Vpu technology &gpgpu computing

GPGPU using CUDA Thrust

Algorithm Engineering „ GPGPU“

GPGPU Based Cortical Modeling - courses.csail.mit.educourses.csail.mit.edu/18.337/2012/projects/ted_hilk_slides.pdfExisting GPGPU Cortical Modldeling Frameworks • Very few examples,

GPGPU: Image Convolution

Computer Vision Algorithm Acceleration Using GPGPU · Boeing Research & Technology | GPGPU GPGPU Pipeline Optimization: After GPU Pipeline is kept full with processing, no “air

Uvod u GPGPU programiranje

GPGPU - UCLouvain

GPGPU COTS Platforms

Improving GPGPU Resource Utilization Through Alternative ...camelab.org/uploads/Main/Improving GPGPU Resource... · Improving GPGPU Resource Utilization Through Alternative Thread

GPGPU Programming Using NVIDIA CUDAresearch.utar.edu.my/centres/dev/CISST/event/GPGPU Programming Using... · GPGPU Programming Using NVIDIA CUDA Prepared by Lee WaiKong Email: wklee@utar.edu.my

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units GPGPU = (GP)²U General Purpose Programming on the GPU „Parallelism for the.

GPGPU Computing and SIMD

GPGPU Tutorial

EPGPU: Expressive Programming for GPGPU

GPGPU algorithms in games

Lecture 11: “GPGPU” computing and the CUDA/OpenCL ...

QIG: Quantifying the Importance and Interaction of GPGPU ...

History of GPGPU