Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

88
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA Rafael Asenjo Dept. of Computer Architecture University of Malaga, Spain.

Transcript of Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Page 1: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Making the most out of Heterogeneous Chips with CPU,

GPU and FPGA

Rafael AsenjoDept. of Computer Architecture

University of Malaga, Spain.

Page 2: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

2

Page 3: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Xilinx Zynq UltraScale+ MPSoC

3

4 CPUsARM

Cortex-A53

GPUARM

Mali 400

2x CPUsARM

Cortex-R5

FPGA

Page 4: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Agenda

• Motivation

• Hardware: iGPUs and iFPGAs

• Software– Programming models for chips with iGPUs and iFPGAs

• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline

4

Page 5: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation• A new mantra: Power and Energy saving• In all domains

5

Page 6: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• GPUs came to rescue:– Massive Data Parallel Code at

a low price in terms of power– Supercomputers and servers:

NVIDIA

• GREEN500 Top 5:– June 2017

• Top500.org (Nov 2017):– 87 systems w. NVIDIA– 12 systems w. Xeon Phi– 5 systems w. PEZY

6

Xeon + Tesla P100

Xeon + Tesla P100

Xeon + Tesla P100

Xeon + Tesla P100

Xeon + Tesla P100

Page 7: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

7

Green 500 Nov 2017

Xeon + PEZY

Xeon + PEZY

Xeon + PEZY

Xeon + PEZY

Page 8: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• There is (parallel) life beyond supercomputers:

8

Page 9: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• Plenty of GPUs elsewhere: – Integrated GPUs on more than 90% of shipped processors

9

Page 10: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• Plenty of GPUs on desktops and laptops: – Desktops (35 – 130 W) and laptops (15 – 60 W):

10

Intel Kaby Lake AMD APU Kaveri

http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/http://www.anandtech.com/show/10610/intel-announces-7th-gen-kaby-lake-14nm-plus-six-notebook-skus-desktop-coming-in-january/2

GPU

Page 11: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• Plenty of integrated GPUs in mobile devices too.

11

Huawei HiSilicon Kirin 960 (2 - 6 W)

http://www.anandtech.com/show/10766/huawei-announces-hisilicon-kirin-960-a73-g71

Huawei Mate 9

Mali

Page 12: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• And user-programmable DSPs as well:

12

Qualcomm Snapdragon 820/821 (2 - 6 W)

https://www.qualcomm.com/products/snapdragon/processors/820

Samsung Galaxy S7 Google Pixel

Page 13: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

• And FPGAs too…– ARM + iFPGA

• Altera Cyclon V• Xilinx Zynq-7000

– Intel + Altera’s FPGA• Heterogeneous

Architecture Research Platform

• CPU + iGPU + iFPGA– Xilinx Zynq UltraScale+

• 4 Cortex A53• 2 Cortex R5• GPU Mali 400• FPGA

Motivation

13 http://www.analyticsengines.com/developer-blog/xilinx-announce-new-zynq-architecture/

4 A53 GPU

2 R5

FPGA

Page 14: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• Overwhelming heterogeneity:

14

Page 15: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Motivation

• Plenty of room for improvements– Want to make the most out of the CPU, the GPU and the FPGA

– Taking productivity into account:Productivity = Performance - Pain

• High level abstractions in order to hide HW details• Lack of high-level productive programming models• “Homogeneous programming of heterogeneous architectures”

– Taking energy consumption into account:Performance ➔ Performance / WattPerformance ➔ Performance / Joule

Productivity = Performance / Joule - Pain15

Page 16: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Agenda

• Motivation

• Hardware: iGPUs and iFPGAs

• Software– Programming models for chips with iGPUs and iFPGAs

• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline

16

Page 17: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Hardware: iGPUs

17

Intel Skylake AMD Kaveri

Iris Pro Graphics Graphic Core Next

HiSilicon Kirin 960

ARM Mali-G71

Page 18: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Skylake – Kaby Lake

18

• Modular design– 2 or 4 cores– GPU

• GT-1: 12 EU• GT-2: 24 EU• GT-3: 48 EU• GT-4: 72 EU

• SKU options– 2+2 (4.5W, 15W)– 4+2 (35W -- 91W)– 2+3 (15W, 28W,...)– 4+4 (65W)– ....

http://www.anandtech.com/show/6355/intels-haswell-architecture

Page 19: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Skylake – Kaby Lake

19

https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf

Page 20: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Processor Graphics

20

GT2GT3

GT4

Page 21: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Processor Graphics

21

• Peak performance:

– 72 EUs– 2 x 4-wide

SIMD – 2 flop/ck

(fmadd)– 1GHz

2 x 4-wide SIMD Units7 in-flight

wavefronts

1.15 TFlops

Page 22: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Processor Graphics

23

NDRange ≈ grid work-group ≈ block

EU-thread (SIMD16) ≈ wavefront ≈ warp:

16 work-items ≈ 16 thrds

=

• Each SIMD16 wavefront takes 2cks• IMT: fine-grained multi-threading

– Latency hiding mechanism

Page 23: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Virt. Technology for Directed I/O (Intel VT-d)

24

Cache coherent path

GPU

CPU

Page 24: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

AMD Kaveri

• Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores.

25http://wccftech.com/

CU CU CU CU

CU CU CU CU

Page 25: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

AMD Graphics Core Next (GCN)

• In Kaveri, – 8 Compute Units (CU)– Each CU: 4 x 16-wide SIMD– Total: 512 FPUs– 866 MHz

• Max GFLOPS=• 0.86 GHz x• 512 FPUs x• 2 fmad =• 880 GFLOPS

26

Page 26: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

OpenCL execution on GCN

Work-group à wavefronts (64 work-items) à pools

27

WG

CU0

SIMD0-0

SIMD0 SIMD1 SIMD2 SIMD3

SIMD1-0SIMD2-0SIMD3-0SIMD0-1SIMD1-1SIMD2-1SIMD3-1SIMD0-2SIMD1-2

4 pools:4 wavefronts in flight per SIMD

4 ck to execute each wavefront

pool number

Wavefronts

Page 27: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

HiSilicon Kirin 960

• ARM64 big.LITTLE (Cortex A73+Cortex A53) + Mali G71

28

Mali

Page 28: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Mali G71

• Supports OpenCL 2.0

• 4, 6, 8, 16, 32 Cores• Each core:

– 3 x 4-wide SIMD units

• 32 x 12 x 2 fmad x 0.85GHz=652 GFLOPS

29

Page 29: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

HSA (Heterogeneous System Architecture)

– CPU, GPU, DSPs,...• Founders:

30

• HSA Foundation’s goal: Productivity on heterogeneous HW

Page 30: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Advantages of iGPUs

• Discrete and integrated GPUs: different goals– NVIDIA Pascal: 3584 CUDA cores, 250W, 10 TFLOPS– Intel Iris Pro Graphics 580: 72EU x 8 SIMD, ~15W, 1.15 TFLOPS– Mali-G71: 32 EU x 3 x 4 SIMD, ~ 2W, 0.652 TFLOPS

• CPU and GPU are both first-class citizens. – SVM and platform atomics allows for closer collaboration

• Avoid PCI data transfer and associated POWER dissipation– Operating System doesn’t get in the way à less overhead

• CPU and GPU may have similar performance– It’s more likely that they can collaborate

• Cheaper!

31

Page 31: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Hardware: iFPGAs

32

Altera Cyclone V Xilinx Zynq UltraScale+ Intel Xeon+Arria10

QPI

FPGA

2 x A9 4 x A53

GPU

2xR5

FPGA

Page 32: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

FPGA Architecture

33

Courtesy: Javier Navaridas, U. Manchester.

Page 33: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

CLB: Configurable Logic Block

34

CIN

SwitchMatrix

COUTCOUT

Slice X0Y0

Slice X0Y1

Fast Connects

Slice X1Y0

Slice X1Y1

CIN

SHIFTIN

Left-Hand SLICEM Right-Hand SLICEL

SHIFTOUT

Courtesy: Jose Luis Nunez-Yanez, U. Bristol.

LUT

LUT

FF

FF

Page 34: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

FPGA Architecture

35

Page 35: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

How do we “mold this liquid silicon”?

• Hardware overlays– Map a CPU or GPU onto the FPGA

ü100s of simple CPU can fit on large FPGAsüThe CPU or GPU overlay can change/adapt!Resulting overlay is not as efficient as a “real” one!Same drawbacks as “general purpose” processors

36

CPU: GPU:

FPGAs for Software Programmers, Dirk Kock, Daniel Ziener

Page 36: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

How do we “mold this liquid silicon”?

üNo FETCH nor DECODE of instructions à already hardwiredüData movement (Exec. Units ⇔ MEM) reduction à Power saveüNot constrained by a fixed ISA:

üBit-level parallelism; shift, bit mask, ...; variable precision arith.!Less frequency (hundreds of MHz)!In order execution (proposal from David Kaeli’s group to OoO)!Programming effort

37

FPGA:

Hardware thread reordering to boost OpenCL throughput onFPGAs, ICCD’16

FPGAs for Software Programmers, Dirk Kock, Daniel Ziener

Page 37: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Xilinx Zynq UltraScale+ MPSoC

• Available: – SW coherence (cache flush) à Slow– HW coherence (MESI coherence protocol) à Fast. Ports:

• Accelerator Coherency Port, ACP • Cache-coherent interconnect (CCI) ACE-Lite ports

– Part of ARM® AMBA® AXI and ACE protocol specification

38

4 x A53

GPU

2xR5

FPGA

Page 38: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Broadwell+Arria10

• Cache Coherent Interconnect (CCI)– New IP inside the FPGA: Intel Quick Path Interconnect with CCI– AFU (Accelerated Function Unit) can access cached data

39

Page 39: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Towards CCI everywhere

• CCIX consortium (http://www.ccixconsortium.com)– New chip-to-chip interconnect operating at 25Gbps– Protocol built for FPGAs, GPUs, network/storage adapters, ...– Already in Xilinx Virtex UltraScale+

40

Page 40: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Which architecture suits me best?

41

• All of them are Turing complete, but• Depending on the problem:

– CPU:• For control-dominant problems• Frequent context switching

– GPU:• Data-parallel problems (vectorized/SIMD)• Little control flow and little synchronization

– FPGA:• Stream processing problems• Large data sets processed in a pipelined fashion• Low power constraints

Page 41: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Agenda

• Motivation

• Hardware: iGPUs and iFPGAs

• Software– Programming models for chips with iGPUs and iFPGAs

• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline

42

Page 42: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

http://imgs.xkcd.com/comics/standards.png

43

HOW STANDARDS PROLIFERATEsee: A/C chargers, Character encodings, Instant Messaging, etc)

Page 43: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Programming models for heterogeneous: GPUs

• Targeted at exploiting one device at a time– CUDA (NVIDIA)– OpenCL (Khronos Group Standard)– OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0)– C++AMP (Windows’ extension of C++. Recently HSA announced own ver.)– Sycl (Khronos Gruop’s specification for “Single-source C++ progr.)– ROCm (Radeon Open Compute, HCC, HIP, for AMD GCN –HSA–) – RenderScript (Google’s Java API for Android)– ParallDroid (Java + Directives from ULL, Spain)– Many more (Numba Python, IBM Java, Matlab, R, JavaScript, …)

• Targeted at exploiting both devices simultaneously (discrete GPUs)– Qilin (C++ and Qilin API compiled to TBB+CUDA)– OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler)– XKaapi– StarPU

• Targeted at exploiting both devices simultaneously (integrated GPUs)– Qualcomm Symphony– Intel Concord

44

Page 44: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Programming models for heterogeneous: GPUs

Targeted at exploiting ONE device at a time

45

parallel_for

CPU GPU

Targeted at exploiting SEVERAL devices at the same time

parallel_forcompiler

+runtime

CPU GPU

CUDA OpenCL

OpenMP4.0C++AMPSycl

ROCm

RenderScriptParallDroid

Numba Python

JavaScriptMatlabR

Qilin OmpSs

XKaapi StarPU

Qualcomm Symphony

Concord

Page 45: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Programming models for heterogeneous: FPGAs

• HDL-like languages– SystemVerilog, Bluespec System Verilog– Extend RTL languages by adding new features

• Parallel libraries– CUDA, OpenCL– Instrument existing parallel libraries to generate RTL code

• C-based languages– SDSoC, LegUp, ROCCC, Impulse C, Vivado, Calypto Catapult C– Compile* C into intermediate representation and from there to HDL– Functions become Finite State Machines and variables mem. blocks

• Higher level languages– Kiwi (C#), Lime (Java), Chisel (Scala)– Translate* input language into HDL

46

* Normallyasubsetthereof

Courtesy: Javier Navaridas, U. Manchester.

Page 46: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

OpenCL 2.0

• OpenCL 2.0 supports:– Coherent SVM (Shared Virtual Memory) & Platform atomics– Example: allocating an array of atomics:

47

clContex = ...clCommandQueue = ...clKernel = ...

cl_svm_mem_flags flags = CL_MEM_SVM_FINE_GRAIN_BUFFER | CL_MEM_SVM_ATOMICS;atomic_int * data;data = (atomic_int *) clSVMAlloc(clContext, flags, NUM * sizeof(atomic_int), 0);

clSetKernelArgSVMPointer(clKernel, 0, data);clSetKernelArg(clKernel, 1, sizeof(int), &NUM);clEnqueueNDRangeKernel(clCommandQueue, clKernel, .... );

atomic_store_explicit ((atomic_int*)&data[0], 99, memory_order_release);

CPU Host Code

kernel void myKernel(global int *data, global int* NUM){

int i=atomic_load_explicit((global atomic_int *)&data[0], memory_order_acquire);....}

GPU Kernel Code

Page 47: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

C++AMP

• C++ Accelerated Massive Parallelism• Example: Vector Add (C[i]=A[i]+B[i])

48

A, B and C buffers

Initialize A and B

Run on GPU

Page 48: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

SYCL Parallel STL

• Parallel STL aimed at C++17 standard:– std::sort(vec.begin(), vec.end()); //Vector sort– std::sort(seq, vec.begin(), vec.end()); // Explicit SEQUENTIAL– std::sort(par, vec.begin(), vec.end()); // Explicit PARALLEL

• SYCL exposes the execution policy (user-defined)

50

std::vector<int>(N);float ratio = 0.5; sycl::het_sycl_policy<class ForEach> policy(ratio);std::for_each(policy, v.begin(), v.end(),

[=](int& val) {val= val*val;}

);

Courtesy: Antonio Vilches

50% of the iterations on CPU50% of the iterations on GPU

https://github.com/KhronosGroup/SyclParallelSTL

Page 49: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Qualcomm Symphony

• Point Kernel: write once, run everywhere– Pattern tuning extends to heterogeneous load hints

51

SYMPHONY_POINT_KERNEL_3(vadd, int, i, float*, a, int, na,float*, b, int, nb, float*, c, int, nc,

{ c[i] = a[i] + b[i];}); ...

symphony::buffer_ptr buf_a(1024), buf_b(1024), buf_c(1024); ... symphony::range<1>(1024) r; symphony::pfor_each(r, vadd_pk, buf_a, buf_b, buf_c,

symphony::pattern::tuner().set_cpu_load(10).set_gpu_load(50) .set_dsp_load(40)

);

Executed across CPU(10%) + GPU(50%) + DSP(40%)A kernel for

each device isautomatically

generated

Page 50: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Intel Concord

• C++ heterogeneous programming framework

• Papers:– Rashid Kaleem et al. Adaptive heterogeneous scheduling on

integrated GPUs. PACT 2014.• Intel Heterogeneous Research Compiler (iHRC) at

https://github.com/IntelLabs/iHRC/– Naila Farooqui et al. Affinity-aware work-stealing for integrated

CPU-GPU processors. PPoPP ’16. PhD dissertation: https://smartech.gatech.edu/bitstream/handle/1853/54915/FAROOQUI-DISSERTATION-2016.pdf

52

class Foo { int *a, *b, *c; void operator(int i) { c[i] = a[i]+b[i]; }

Foo *f = new Foo();

parallel_for(0, N, *f); Executed across CPU+GPU, dynamic partition

Page 51: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Xilinx SDSoC

• Without SDSoC:

53

FPGA

CPU

main(){

init(A,B,C);

mmult(A,B,D);

madd(C,D,E);

consume(E);

}

Courtesy: Xilix

mmult madd

ABC

D

E

Page 52: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Xilinx SDSoC

• With SDSoC:– Eclipse-based SDK

54

FPGA

CPU

Refine Code:

• Re-factoring code to be more HLS-friendly

• Adding code annotations (i.e., pragmas)

#pragma SDS data copy(Array)#pragma SDS zero_copy(Array)#pragma SDS data access_pattern#pragma SDS data sys_port (...)#pragma HLS PIPELINE II = 1#pragma HLS unroll#pragma HLS inline

Page 53: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Altera OpenCL

55

main() {read_data( ... );manipulate( ... );clEnqueueWriteBuffer( ... );clEnqueueNDRange(...,sum,...);clEnqueueReadBuffer( ... );display_result( ... );

}

__kernel voidsum(__global float *a,

__global float *b, __global float *y)

{ int i = get_global_id(0); c[i] = a[i] + b[i];

}

Host Code

Kernel Code

StandardC Compiler

OpenCLaoc compiler

Verilog

.EXE

.OUT.AOCX

bitstream

CPU FPGA

Page 54: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Altera OpenCL optimizations

• Locality:– Shift-Register Pattern (SRP)– Local Memory (LM)– Constant Memory (CM)– Load-Store Units Buffers (LSU)

• Loop Unrolling (LU)– Deeper pipelines– Optimized tree-like reductions

• Kernel vectorization (SIMD)– Wider ALUs

• Compute Unit Replication (CUR)– Replicate Pipelines– Downsides: memory divergence ⬆, complexity ⬆, frequency ⬇

56

Page 55: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Shift Register Pattern

• Example: 1D stencil computation

57

x x x

+

NDRangeNo SRP

SingleTaskSRP

x x x

+

shft_reg

64x faster than Naive (17 elem. filter)“Tuning Stencil Codes in OpenCL for

FPGAs”, ICCD’16

Shift !!

Page 56: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Agenda

• Motivation

• Hardware: iGPUs and iFPGAs

• Software– Programming models for chips with iGPUs and iFPGAs

• Our heterogeneous templates based on TBB– parallel_for– parallel pipeline

58

Page 57: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Our work: parallel_for

• Heterogeneous parallel for based on TBB– Dynamic scheduling / Adaptive partitioning for irregular codes– CPU and GPU chunk size re-computed every time

59

A15

GPU

A15

A15

A15

A7

A7

A7

Problem:BarnesHut

Archit.:Exynos 5

One A7 feedsthe GPU

Time

Time to executethis chunk of

iterations on GPU

Page 58: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

New scheduler and API

60

Page 59: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

New API

61

Use the GPU

Exploit Zero-Copy-Buffer

Scheduler class

Page 60: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

62

Page 61: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Related work: Qilin

63

• Static Approach: Bulk-Oracle Offline search (11 executions)

– Work partitioning between CPU and GPU

– One big chunk with 70% of the iterations: on the GPU– The remaining 30% processed on the cores

0"

50"

100"

150"

0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%"Percentage)of)the)itera.on)space)offloaded)to)the)GPU)

Barnes)Hut:)Offline)search)for)sta.c)par..on)

Execu2on"2me"in"seconds"

Page 62: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Related work: Intel Concord

64

• Papers:– Rashid Kaleem et al.

Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014.

– Naila Farooqui et al. Affinity-aware work-stealing for integrated CPU-GPU processors. PPoPP ’16(poster).

assign chunksize just for

GPU

computeon CPU

computechunk on

GPU

when the GPU is done

according to relative speeds:partition the rest of the iteration

space

Page 63: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Related Work: HDSS

65

• Belviranli et al, A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures, TACO 2013

TACO0904-57 ACM-TRANSACTION December 22, 2012 2:21

57:8 M. E. Belviranli et al.

Fig. 6. An example showing block assignments by time for a two-threaded (1 GPU and 1 CPU thread) run ofHistogram application on CPU+GPU system. Figure (a), on the left, illustrates the assignments dispatchedduring the adaptive phase. In this figure, block size axis is shown in logarithmic scale for a better separationof block assignments in the chart. Figure (b), on the right, shows the two phases of the algorithm separatedwith a clear jump in assigned block sizes just after computational weights are determined and stabilized.

a block size is retrieved, the programmer must call processor-specific implementations(kernels) of her application on the determined data block.

3.2. The AlgorithmOverview. The HDSS algorithm determines the ideal block size per processor usinga two-phased approach. The initial phase, called the adaptive phase, is responsiblefor accurately finding the computational weights which reflect the relative processorspeeds. This phase processes a relatively small amount of data whose upper bound isset as a fixed percentage of the total data. The final phase, called completion phase,processes the remainder of the data using the weights computed in the adaptive phase.This phase starts with the largest possible blocks to optimize the execution time byreducing the dispatching overhead. Block size decreases steadily as the computationworks towards end so that all units complete execution at nearly the same time. Asan example, the progression of block sizes during the adaptive phase is illustratedin Figure 6(a). The experiment is for Histogram application on a CPU+GPU systemwith 210M inputs. The overall process including completion phase is illustrated inFigure 6(b) for the same experiment. The sudden increase in the curve is the momentwhere HDSS assigns largest possible block for the very first block request of the secondphase.

A flow chart for our proposed self-scheduling algorithm, HDSS, is given in Figure 7.Our algorithm stores the remaining number of iterations and total computationalweight as global variables. In addition, each processing element has the followingattributes:

—the initial block size (user-defined);—computational weight in iterations per microsecond;—block size factor (user-defined);—dynamic list of blocks that have been assigned to the processor before. Each entry in

the list contains the size of the block as well as the execution start and end times ofthe block.

For the first block assignment, the initial block size parameter for the requestingprocessor is set as the block size. During subsequent block requests, HDSS decideswhether to continue with the adaptive phase or move to the next phase based ondynamics of the execution. Once HDSS concludes that the adaptive phase should beterminated, all subsequent block assignments by all processors are calculated by thecompletion phase.

ACM Transactions on Architecture and Code Optimization, Vol. 9, No. 4, Article 57, Publication date: January 2013.

FermiDiscrete

GPU

Page 64: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Debunking big chunks

• Example: irregular Barnes-Hut benchmark

66

Page 65: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Debunkign big chunks

A larger chunk causes more memory traffic

67

UpdatedchunkArray of bodies:

Writtenbody

Readbody

UpdatedchunkArray of bodies:

Writtenbody

Readbody

Page 66: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Debunking big chunks

• BarnesHut: GPU throughput along the iteration space– For two different time-steps– Different GPU chunk-sizes

68

0 2 4 6 8 10x 104

20

40

60

80

100

120

140

160

Barnes−Hut: Throughput variation (time step=0)

Iteration Space

Thro

ughp

ut

chunk=320chunk=640chunk=1280chunk=2560

0 2 4 6 8 10x 104

20

40

60

80

100

120

140

160

Barnes−Hut: Throughput variation (time step=5)

Iteration Space

Thro

ughp

ut

chunk=320chunk=640chunk=1280chunk=2560

640

2560

1280

640

2560

Time step=0 Time step=5

Page 67: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

LogFit main idea

70

Time

GPU

Core1

Core2

Core3

Core4

Time to executethischunkof

iterationsonGPU

ExpPhs

StablePhase FP

IntelCorei7Haswell

GPU

Core

Core

CoreCore

Shared

L3

Page 68: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Parallel_for: Performance per joule

• Static: Oracle-Like static partition of the work based on profiling• Concord: Intel approach: GPU size provided by user once• HDSS: Belviranli et al. approach: GPU size computed once

• LogFit: our work

72

GPU

GPU+1Core

GPU+2Cores

GPU+4Cores

GPU+3Cores

Antonio Vilches et al. Adaptive Partitioning for Irregular

Applications on Heterogeneous CPU-GPU Chips, Procedia Computer Science, 2015.

On Intel HaswellCore i7-4770

Page 69: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

LogFit: better for irregular benchmarks

73

Page 70: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

LogFit working on FPGA

• Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.)

74

DE1-SoC

Altera Cyclone V

AccelPower Cap + BeagelBone

Shunt resistor(power flow)

Page 71: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

LogFit working on FPGA

75

LogFit2C+FPGA

Static

• Terasic DE1: Altera Cyclone V (FPGA+2 Cortex A9)• Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.)

Page 72: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Our work: pipeline• ViVid, an object detection application• Contains three main kernels that form a pipeline

• Would like to answer the following questions:– Granularity: coarse or fine grained parallelism?– Mapping of stages: where do we run them (CPU/GPU)?– Number of cores: how many of them when running on CPU?– Optimum: what metric do we optimize (time, energy, both)?

76

Stag

e 1

Stag

e 2

Stag

e 3

Inputframe Filter Histogram Classifier

Output

response mtx.

index mtx.

histogramsdetect. response

Page 73: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Granularity

• Coarse grain:– CG

• Medium grain:– MG

• Fine grain:

– Fine grain also in the CPU via AVX intrinsics

77

Input Stage Output Stage

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CPU

Stage 2

Citem

CPU

Stage 3

Citem

C =core

Input Stage

CPU

Citem

Output Stage

CPU

Citem

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

Input Stage Output Stage

GPUC C CC C C

itemCPU

Citem

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

Stage 1 Stage 2 Stage 3

Page 74: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Our work: pipeline

78

Input Stage Output Stage

GPUC C CC C C

itemCPU

Citem

CPU

Citem

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

Stage 1

CPU

GPU

CPU

GPU

CPU

GPU

CPU CPU CPU

GPU

CPU CPU CPU CPU

CPU CPU

GPU

CPU

GPU

CPU CPU CPU

GPU

CPU CPU

GPU

CPU CPU

CPU CPU CPU CPU CPU

CPU CPU

GPU

CPU

GPU

CPU CPU

GPU

CPU CPU CPU CPU CPU

CPU CPU CPU CPU CPU

GPU

111 100

010 001

110 101

011 000

Page 75: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Accounting for all alternatives

• In general: nC CPU cores, 1 GPU and p pipeline stages

# alternatives = 2p x (nC +2)

• For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives

79

Input Stage

GPUC C CC C C

item

CPU

Citem

Output Stage

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

CPU

Stage 1

itemCPU

Stage 2

itemCPU

Stage p

itemC C

nC coresC C

nC coresC C

nC cores

Mapping streaming applications on commodity multi-CPU and GPU on-chip processors, IEEE Tran. Parallel and Distributed Systems, 2016.

Page 76: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Framework and Model

• Key idea: 1. Run only on GPU2. Run only on CPU3. Analytically extrapolate for heterogeneous execution4. Find out the best configuration à RUN

80

Input Stage

GPUC C CC C C

item

CPU

Citem

Output Stage

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

DP-MG

collect λ and E (homogeneous values)

Page 77: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Environmental Setup: Benchmarks

• Four Benchmarks– ViVid (Low and High Definition inputs)

– SRAD

– Tracking

– Scene Recognition

81

InptFilter Histogram Classifier

St 1 St 2 St 3 Out

InptExtrac. Prep. Reduct.

St 1 St 2 St 3 St 4Comp.1 Comp. 2

St 5 St 6 OutStatist.

InptTrack.

St 1 Out

InptFeature. SVM

St 1 St 2 Out

Page 78: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Does Higher Throughput imply Lower Energy?

82

1.0E-03 2.0E-03 3.0E-03 4.0E-03 5.0E-03 6.0E-03 7.0E-03 8.0E-03 9.0E-03

Num-Threads(CG)

6.0

8.0

10.0

12.0

14.0

16.0

Num-Threads(CG)

0.0E+00

2.0E-04

4.0E-04

6.0E-04

8.0E-04

1.0E-03

Num-Threads(CG)

HaswellHD

Page 79: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

1.0E-01

3.0E-01

5.0E-01

7.0E-01

9.0E-01

1.1E+00

Num-Threads(CG)

SRAD on Haswell

84

0.0E+00

1.0E-01

2.0E-01

3.0E-01

4.0E-01

5.0E-01

6.0E-01

7.0E-01

8.0E-01

Num-Threads(CG)

0.0E+00 2.0E-02 4.0E-02 6.0E-02 8.0E-02 1.0E-01 1.2E-01 1.4E-01 1.6E-01 1.8E-01

Num-Threads(CG)

Input Stage

GPUC C C

C C C

item

CPU

Citem

CPU

Stage 1

Citem

GPUC C C

C C C

item

CPU

Stage 2

Citem

Output Stage

CPU

Citem

GPUC C C

C C C

item

CPU

Stage 6

Citem

DP-CGGPUC C C

C C C

item

CPU

Stage 3

Citem

GPUC C C

C C C

item

CPU

Stage 4

Citem

GPUC C C

C C C

item

CPU

Stage 5

Citem

Page 80: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Results on big.LITTLE architecture

85

• On Odroid XU3: 4 Cortex A15 + 4 Cortex A7 + GPU Mali

Input Stage Output Stage

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CPU

Stage 2

Citem

CPU

Stage 3

Citem

C =core

5 thr (GPU+4A15)

1thr (A15)

2thr (2 A15)4thr (4 A15)

5thr (4 A15+1 A7)

9thr (4 A15+ 4 A7)

1 thr (GPU) 9 thr (GPU+4A15+4A7)

Page 81: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Pipeline on CPU-FPGAs

• Change the GPU by a FPGA– Core i7 4770k 3.5 GHz Haswell quad-core CPU– Altera DE5-Net FPGA from Terasic– Intel HD6000 embedded GPU – Nvidia Quadro K600 discrete GPU (192 cuda cores)

• Homogeneous results:

86

Device Time (sec.)

Throughput (fps)

Power (watts)

Energy (Jules)

CPU (1 core) 167.610 0.596& 37& 6201&CPU (4 cores) 50.310& 1.987& 92& 4628&

FPGA 13.480& 7.418& 5" 67"On-chip GPU 7.855" 12.730" 16& 125&Discrete GPU 13.941& 7.173& 47& 655&

Page 82: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Pipeline on CPU-FPGA

• Heterogeneous results based on our TBB scheduler:- Accelerator + CPU- Mapping 001

87

onGPU+CPU ~ FPGA+CPU

Page 83: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Conclusions• Plenty of heterogeneous on-chip architectures out there

• Why not use all devices?– Need to find the best mapping/distribution/scheduling out of the

many possible alternatives.

• Programming models and runtimes aimed at this goal are in their infancy: striving to solve this.

• Challenges: – Hide HW complexity providing a productive programming model– Consider energy in the partition/scheduling decisions– Minimize overhead of adaptation policies

88

Page 84: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Future Work

• ExploitCPU-GPU-FPGAchipsandcoherency• PredictenergyconsumptiononheterogeneousCPU-

GPU-FPGAchips

• Considerenergyconsumptioninthepartitioningheuristic

• Rebuildtheschedulerengineontopofthenew TBBlibrary(OpenCLnode)

• Tackleotherirregularcodes

89

Page 85: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Future Work: TBB OpenCL node

buffer

get_next_image preprocess

detect_with_A

detect_with_B

make_decision

Can express pipelining, task parallelism and data parallelism

OpenCL node

Page 86: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Future Work: other irregular codes

91

1

2 3 4

11 12 13 14

5 6 7 8 9 10

15 16 17

1918

Bottom-Up

a) Selective

1

2 3 4

11 12 13 14

5 6 7 8 9 10

15 16 17

1918

b) Concurrent

Top-Dow

n

1

2 3 4

11 12 13 14

5 6 7 8 9 10

15 16 17

1918

c) Asynchronous

5

1 CPU

GPU

19 Both

Node computed on:

CPU

GPU

Both

Frontier computed on:

C.H.G Vázquez, PhD, Library based solutions for algorithms with complex patterns, 2015

Page 87: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Collaborators

• Mª Ángeles González Navarro (UMA, Spain)• Andrés Rodríguez (UMA, Spain)• Francisco Corbera (UMA, Spain)• Jose Luis Nunez-Yanez (University of Bristol)• Antonio Vilches (UMA, Spain)• Alejandro Villegas (UMA, Spain)• Juan Gómez Luna (U. Cordoba, Spain)• Rubén Gran (U. Zaragoza, Spain)• Darío Suárez (U. Zaragoza, Spain)• Maria Jesús Garzarán (Intel and UIUC, USA)• Mert Dikmen (UIUC, USA)• Kurt Fellows (UIUC, USA)• Ehsan Totoni (UIUC, USA)• Luis Remis (Intel, USA)

92

Page 88: Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Questions

[email protected]

93