Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

Online Performance Projection for Clusters with Heterogeneous GPUs

Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

Jiayuan Meng, Pavan Balaji (Argonne National Laboratory, USA)

synergy.cs.vt.eduLokendra Panwar ([email protected])

Diversity in Accelerators

2

Nov, 2008 Nov, 2013

Performance Share of Accelerators in Top500 Systems

Source: top500.org


Heterogeneity “Among” Nodes

• Clusters are deploying different accelerators– Different accelerators for different tasks

• Example clusters:– “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs– “Darwin” at LANL: NVIDIA GPUs, AMD GPUs– “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs

3


Heterogeneity “Among” Nodes

• Clusters are deploying different accelerators– Different accelerators for different tasks

• Example clusters:– “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs– “Darwin” at LANL: NVIDIA GPUs, AMD GPUs– “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs

• However …A unified programming model for “all” accelerators: OpenCL– CPUs, GPUs, FPGAs, DSPs

4


Affinity of Tasks to Processors

• Peak performance doesn’t necessarily translate into actual device performance.

5

Reduction GFLOPs Global Memory BW (GB/s)

Actual Time(ms)

NVIDIA C2050 1030 144 0.13

AMD HD5870 2720 154 0.21


Affinity of Tasks to Processors

• Peak performance doesn’t necessarily translate into actual device performance.

6

int main() { cl_int error; cl_platform_id platform; cl_device_id device; cl_uint platforms, devices; // Fetch the Platform and Device IDs error=clGetPlatformIDs(1, &platform, &platforms); error=clGetDeviceIDs(platform, .., &devices); cq = clCreateCommandQueue(context, .., &error); prog=clCreateProgramWithSource(context, 1, srcptr, &srcsize, &error); error=clBuildProgram(prog, 0, NULL, "", NULL, NULL); // Perform the operation error=clEnqueueNDRangeKernel(cq, .., NULL); // Read the resul error=clEnqueueReadBuffer(cq, .., NULL); // Await completion of all the above error=clFinish(cq); }

OpenCL Program

?

Reduction GFLOPs Global Memory BW (GB/s)

Actual Time(ms)

NVIDIA C2050 1030 144 0.13

AMD HD5870 2720 154 0.21


Challenges for Runtime Systems

• It is crucial for heterogeneous runtime systems to embrace different accelerators in clusters w.r.t. performance and power

• Examples of OpenCL runtime systems:– SnuCL– VOCL– SOCL

• Challenges:– Efficiently choose the right device for the right task– Keep the decision making overhead minimal

7

synergy.cs.vt.edu

Our Contributions

• An online workload characterization technique for OpenCL kernels

• Our model projects the relative ranking of different devices with little overhead

• An end-to-end evaluation of our technique for multiple architectural families of AMD and NVIDIA GPUs

Lokendra Panwar ([email protected])

8


Outline

• Introduction• Motivation• Contributions• Design• Evaluation• Conclusion

9


Design

• Goal: – Rank accelerators for a given OpenCL workload

• Accurately AND efficiently– Decision making with minimal overhead

10


Design



• Choices:– Static Code Analysis:

• Fast• Inaccurate, as it does not account for dynamic properties:

– Input data dependence, memory access patterns, dynamic instructions

11


Design



• Choices:– Static Code Analysis:

• Fast• Inaccurate, as it does not account for dynamic properties:

– Input data dependence, memory access patterns, dynamic instructions

– Dynamic Code Analysis:• Higher accuracy• Execute either on actual device or through a “emulator”

– Not always feasible to run on actual devices: Data transfer costs, Clusters are “busy”

– Emulators are very slow 12


Design – Workload Profiling

13

EmulatorOpenCLKernel Memory Patterns

Bank Conflicts

InstructionMix


Design – Workload Profiling

• “Mini-emulation”– Emulate a single workgroup

• Collect dynamic characteristics:– Instruction traces– Global and Local memory transactions and access

patterns• In typical data-parallel workloads, workgroups exhibit similar

runtime characteristics– Asymptotically lower overhead

14

MiniEmulator

OpenCLKernel Memory Patterns

Bank Conflicts

InstructionMix


Design – Device Profiling

15

GPU1

GPU 2

GPU N

……

Instruction and Memory

Microbenchmarks

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)Device Throughput Profiles


Design – Device Profiling

• Build device throughput profiles:– Modified SHOC microbenchmarks to

• Obtain hardware throughput with varying occupancy• Collect throughputs for instructions, global memory and local

memory– Built only once

16

Global and Local memory profile of AMD 7970

1/32 1/16 1/8 1/4 1/2 1 10

100

1000

10000Global Memory ReadGlobal Memory WriteLocal Memory ReadLocal Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)


Design – Find Performance Limiter

17

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)

Memory PatternsBank Conflicts

InstructionMix

Device Profile

Workload Profile

Performance Bound0

0.51

1.52

2.53

3.54

4.55

ComputeGlobal MemLocal Mem

Pro

jec

ted

Tim

e


Design – Find Performance Limiter

• Single workgroup dynamic characteristics Full kernel characteristics– Device occupancy as scaling factor

18

Performance Bound0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

ComputeGlobal MemLocal Mem

Pro

ject

ed

Tim

e

• Compute projected theoretical times:• Instructions• Global memory• Local memory

• GPUs aggressively try to hide latencies of components

• Performance limiter = max(tlocal, tglobal, tcompute)*

• Compare the normalized predicted times and choose best device

*Zhang et. al. A Quantitative Performance Analysis Model for GPU Architectures, HPCA’2011


Design

19

GPU 1

GPU 2

GPU N

……

Instruction and Memory Benchmarks

Static Profiling

Dev

ice

P

rofil

e

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Local Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)


Design

20

Mini-Emulator (Single

workgroup)

GPU Kernel

Memory Patterns

Bank Conflicts

InstructionMix

GPU 1

GPU 2

GPU N

……


Static Profiling

Dynamic ProfilingD

evic

e

Pro

file

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Local Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)


Design

21

Mini-Emulator (Single

workgroup)

GPU Kernel

EffectiveInstruction Throughput

Effective Global Memory

Bandwidth

Effective Local Memory

Bandwidth

GPU 1

GPU 2

GPU 3

GPU 4

Relative GPUPerformances

Memory Patterns

Bank Conflicts

InstructionMix

GPU 1

GPU 2

GPU N

……


Static Profiling

Dynamic ProfilingD

evic

e

Pro

file

Perf. Limiter?

Performance Projection

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Local Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)


Outline


22


Experimental Setup

• Accelerators:– AMD 7970 : Scalar ALUs, Cache hierarchy– AMD 5870: VLIW ALUs– NVIDIA C2050: Fermi Architecture Cache Hierarchy– NVIDIA C1060: Tesla Architecture

• Simulators: – Multi2simv4.1 for AMD and GPGPU-Sim v3.0 for NVIDIA devices– Methodology agnostic to specific emulator

• Applications:

23

FloydWarshall

FastWalshTrasnform

MatrixMul(global)

MatrixMul(local)

Num Nodes = 192 Array Size = 1048576 Matrix Size = [1024,1024]

Matrix Size = [1024,1024]

Reduction NBody AESEncryptDecrypt

MatrixTranspose

ArraySize =1048576 NumParticles=32768 Width=1536, Height=512

Matrix Size = [1024,1024]


Application Boundedness : AMD GPUs

24

0.01

0.1

1

10

100HD 5870

Pro

ject

ed T

ime

(Nor

mal

ized

)

Fast Walsh Transform

Floyd Warshall

MatrixMultiply (Gmem only)

NbodyAESEncryptDecrypt

Reduction

MatrixMultiply (Lmem)

MatrixTranspose

1

10HD 7970 Gmem Lmem

Compute

gmem gmem gmem

compute

lmem lmemgmem gmem

gmem gmem gmem gmem gmem

compute

compute lmem


Application Boundedness Summary

25

Application AMD 5870

AMD 7970

NVIDIA C1060

NVIDIA C2050

FloydWarshall gmem gmem gmem gmem

FastWalshTransform gmem gmem gmem gmem

MatrixTranpose gmem gmem gmem gmem

MatMul(global) gmem gmem gmem gmem

MatMul(local) local local gmem compute

Reduction gmem gmem gmem compute

NBody compute compute compute compute

AESEncryptDecrypt local compute compute compute


Accuracy of Performance Projection

26

.

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Act

ual


Floyd Warshall

MatrixMultiply (Gmem)

NbodyAESEncryptDecr.

Reduction


MatrixTranspose

0.1

1

10

100

Pro

ject

ed



27

.

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Act

ual


Floyd Warshall



Reduction


MatrixTranspose

0.1

1

10

100

Pro

ject

ed

Best Device

Fast Walsh

Floyd Warshal

Matmul(global)

Nbody AES Encrypt Decrypt

Reduction Matmul (local)

Mat Transpose

Actual 7970 7970 5870 7970 2050 7970 7970 2050

Projected 7970 7970 5870 7970 7970 7970 7970 2050



28

.

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Act

ual


Floyd Warshall



Reduction


MatrixTranspose

0.1

1

10

100

Pro

ject

ed

Best Device

Fast Walsh

Floyd Warshal

Matmul(global)

Nbody AES Encrypt Decrypt

Reduction Matmul (local)

Mat Transpose

Actual 7970 7970 5870 7970 2050 7970 7970 2050

Projected 7970 7970 5870 7970 7970 7970 7970 2050


Emulation Overhead – Reduction Kernel

29

65536 131072 262144 524288 10485760.01

0.1

1

10

100

Full Kernel Emulation (C2050) Single Workgroup Emulation (C2050)Full Kernel Emulation (HD 7970) Single Workgroup Emulation (HD 7970)

Data Size (x)

Ke

rne

l E

mu

lati

on

Tim

e (

s)


Outline


30


90/10 Paradigm -> 10x10 Paradigm

• Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”)– Narrower focus on applications (10%)– Simplified and specialized accelerators for each classification

• Why?– 10x lower power, 10x faster -> 100x energy efficient

31Figure credit: A. Chien, Salishan Conference 2010

synergy.cs.vt.edu

Conclusion

• We presented a “Mini-emulation” technique for online workload characterization for OpenCL kernels– The approach is shown to be sufficiently accurate for relative

performance projection– The approach has asymptotically lower overhead than projection

using full kernel emulation

• Our technique is shown to work well with multiple architectural families of AMD and NVIDIA GPUs

• With the increasing diversity in accelerators (towards 10x10*), our methodology only becomes more relevant.

*S. Borkar and A. Chien, “The future of microprocessors,” Communications of the ACM, 2011

Lokendra Panwar ([email protected])

32


Thank You

33


Backup

34


Evolution of Microprocessors: 90/10 Paradigm• Derive common cases for applications (90%)

– Broad focus on application workloads

• Architectural improvements for 90% of cases– Design an aggregated generic “core”– Lesser customizability for applications



90/10 Paradigm -> 10x10 Paradigm

• Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”)– Narrower focus on applications (10%)– Simplified and specialized accelerators for each classification

• Why?– 10x lower power, 10x faster -> 100x energy efficient



Application Boundedness : NVIDIA GPUs

37

0.01

0.1

1

10

100

C1060

Pro

ject

ed T

ime

(Nor

mal

ized

)


Floyd Warshall

MatrixMultiply (Gmem only)

NbodyAESEncryptDecrypt

Reduction


MatrixTranspose

0.01

0.1

1

10

100

1000

C2050Gmem

Lmem

Compute

gmem gmem gmem

compute

compute

compute lmem gmem

gmem gmem gmem gmem gmem

compute

compute

compute

synergy.cs.vt.edu

Evaluation: Projection Accuracy (Relative to C1060)

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Rel

ativ

e E

xecu

tio

n T

ime

Fas

t W

alsh

Tra

nsfo

rm

Flo

yd W

arsh

all

Mat

rixM

ultip

ly (

Gm

em)

Nbo

dy

AE

SE

ncry

ptD

ecr.

Red

uctio

n

Mat

rixM

ultip

ly (

Lmem

)

Mat

rixT

rans

pose

0.1

1

10

100

0

50

100C1060 C2050 HD 5870 HD 7970

Pro

ject

ed R

elat

ive

Exe

cu-

tio

n T

ime

Rel

ativ

e E

rro

r (%

)

synergy.cs.vt.edu

Evaluation: Projection Overhead vs. Actual Kernel Execution of Matrix Multiplication

64 128 256 512 1024 2048 4096 81920.0001

0.001

0.01

0.1

1

10

100Actual Device Execution (C2050)Single Workgroup Emulation (C2050)

Data Size (x = y = z)

Ker

nel

Exe

cuti

on

Tim

e (s

)

synergy.cs.vt.edu

Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Matrix Multiplication

64 128 256 512 10240.1

1

10

100

1000

10000

Full Kernel Emulation (C2050) Single Workgroup Emulation (C2050)

Full Kernel Emulation (HD 7970) Single Workgroup Emulation (HD 7970)

Data Size (x = y = z)

Ker

nel

Em

ula

tio

n T

ime

(s)

synergy.cs.vt.edu

Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Reduction

65536 131072 262144 524288 10485760.01

0.1

1

10

100

Full Kernel Emulation (C2050)

Single Workgroup Emulation (C2050)

Full Kernel Emulation (HD 7970)

Data Size (x)

Ker

nel

Em

ula

tio

n T

ime

(s)

Online Performance Projection for Clusters with Heterogeneous GPUs

Documents

Transcript of Online Performance Projection for Clusters with Heterogeneous GPUs