Online Performance Projection for Clusters with Heterogeneous GPUs

41
synergy.cs.vt .edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA) Jiayuan Meng, Pavan Balaji (Argonne National Laboratory, USA)

description

Online Performance Projection for Clusters with Heterogeneous GPUs. Lokendra S. Panwar , Ashwin M. Aji , Wu- chun Feng (Virginia Tech, USA) Jiayuan Meng , Pavan Balaji ( Argonne National Laboratory, USA). Diversity in Accelerators. Nov, 2013. Nov, 2008. - PowerPoint PPT Presentation

Transcript of Online Performance Projection for Clusters with Heterogeneous GPUs

Page 1: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

 Online Performance Projection for Clusters with Heterogeneous GPUs 

Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

Jiayuan Meng, Pavan Balaji (Argonne National Laboratory, USA)

Page 2: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Diversity in Accelerators

2

Nov, 2008 Nov, 2013

Performance Share of Accelerators in Top500 Systems

Source: top500.org

Page 3: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Heterogeneity “Among” Nodes

• Clusters are deploying different accelerators– Different accelerators for different tasks

• Example clusters:– “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs– “Darwin” at LANL: NVIDIA GPUs, AMD GPUs– “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs

3

Page 4: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Heterogeneity “Among” Nodes

• Clusters are deploying different accelerators– Different accelerators for different tasks

• Example clusters:– “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs– “Darwin” at LANL: NVIDIA GPUs, AMD GPUs– “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs

• However …A unified programming model for “all” accelerators: OpenCL– CPUs, GPUs, FPGAs, DSPs

4

Page 5: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Affinity of Tasks to Processors

• Peak performance doesn’t necessarily translate into actual device performance.

5

Reduction GFLOPs Global Memory BW (GB/s)

Actual Time(ms)

NVIDIA C2050 1030 144 0.13

AMD HD5870 2720 154 0.21

Page 6: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Affinity of Tasks to Processors

• Peak performance doesn’t necessarily translate into actual device performance.

6

int main() { cl_int error; cl_platform_id platform; cl_device_id device; cl_uint platforms, devices; // Fetch the Platform and Device IDs error=clGetPlatformIDs(1, &platform, &platforms); error=clGetDeviceIDs(platform, .., &devices); cq = clCreateCommandQueue(context, .., &error); prog=clCreateProgramWithSource(context, 1, srcptr, &srcsize, &error); error=clBuildProgram(prog, 0, NULL, "", NULL, NULL); // Perform the operation error=clEnqueueNDRangeKernel(cq, .., NULL); // Read the resul error=clEnqueueReadBuffer(cq, .., NULL); // Await completion of all the above error=clFinish(cq); }

OpenCL Program

?

Reduction GFLOPs Global Memory BW (GB/s)

Actual Time(ms)

NVIDIA C2050 1030 144 0.13

AMD HD5870 2720 154 0.21

Page 7: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Challenges for Runtime Systems

• It is crucial for heterogeneous runtime systems to embrace different accelerators in clusters w.r.t. performance and power

• Examples of OpenCL runtime systems:– SnuCL– VOCL– SOCL

• Challenges:– Efficiently choose the right device for the right task– Keep the decision making overhead minimal

7

Page 8: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

Our Contributions

• An online workload characterization technique for OpenCL kernels

• Our model projects the relative ranking of different devices with little overhead

• An end-to-end evaluation of our technique for multiple architectural families of AMD and NVIDIA GPUs

Lokendra Panwar ([email protected])

8

Page 9: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Outline

• Introduction• Motivation• Contributions• Design• Evaluation• Conclusion

9

Page 10: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design

• Goal: – Rank accelerators for a given OpenCL workload

• Accurately AND efficiently– Decision making with minimal overhead

10

Page 11: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design

• Goal: – Rank accelerators for a given OpenCL workload

• Accurately AND efficiently– Decision making with minimal overhead

• Choices:– Static Code Analysis:

• Fast• Inaccurate, as it does not account for dynamic properties:

– Input data dependence, memory access patterns, dynamic instructions

11

Page 12: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design

• Goal: – Rank accelerators for a given OpenCL workload

• Accurately AND efficiently– Decision making with minimal overhead

• Choices:– Static Code Analysis:

• Fast• Inaccurate, as it does not account for dynamic properties:

– Input data dependence, memory access patterns, dynamic instructions

– Dynamic Code Analysis:• Higher accuracy• Execute either on actual device or through a “emulator”

– Not always feasible to run on actual devices: Data transfer costs, Clusters are “busy”

– Emulators are very slow 12

Page 13: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design – Workload Profiling

13

EmulatorOpenCLKernel Memory Patterns

Bank Conflicts

InstructionMix

Page 14: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design – Workload Profiling

• “Mini-emulation”– Emulate a single workgroup

• Collect dynamic characteristics:– Instruction traces– Global and Local memory transactions and access

patterns• In typical data-parallel workloads, workgroups exhibit similar

runtime characteristics– Asymptotically lower overhead

14

MiniEmulator

OpenCLKernel Memory Patterns

Bank Conflicts

InstructionMix

Page 15: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design – Device Profiling

15

GPU1

GPU 2

GPU N

……

Instruction and Memory

Microbenchmarks

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)Device Throughput Profiles

Page 16: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design – Device Profiling

• Build device throughput profiles:– Modified SHOC microbenchmarks to

• Obtain hardware throughput with varying occupancy• Collect throughputs for instructions, global memory and local

memory– Built only once

16

Global and Local memory profile of AMD 7970

1/32 1/16 1/8 1/4 1/2 1 10

100

1000

10000Global Memory ReadGlobal Memory WriteLocal Memory ReadLocal Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)

Page 17: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design – Find Performance Limiter

17

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Occupancy

Th

rou

gh

pu

t (G

B/s

)

Memory PatternsBank Conflicts

InstructionMix

Device Profile

Workload Profile

Performance Bound0

0.51

1.52

2.53

3.54

4.55

ComputeGlobal MemLocal Mem

Pro

jec

ted

Tim

e

Page 18: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design – Find Performance Limiter

• Single workgroup dynamic characteristics Full kernel characteristics– Device occupancy as scaling factor

18

Performance Bound0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

ComputeGlobal MemLocal Mem

Pro

ject

ed

Tim

e

• Compute projected theoretical times:• Instructions• Global memory• Local memory

• GPUs aggressively try to hide latencies of components

• Performance limiter = max(tlocal, tglobal, tcompute)*

• Compare the normalized predicted times and choose best device

*Zhang et. al.  A Quantitative Performance Analysis Model for GPU Architectures, HPCA’2011

Page 19: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design

19

GPU 1

GPU 2

GPU N

……

Instruction and Memory Benchmarks

Static Profiling

Dev

ice

P

rofil

e

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Local Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)

Page 20: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design

20

Mini-Emulator (Single

workgroup)

GPU Kernel

Memory Patterns

Bank Conflicts

InstructionMix

GPU 1

GPU 2

GPU N

……

Instruction and Memory Benchmarks

Static Profiling

Dynamic ProfilingD

evic

e

Pro

file

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Local Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)

Page 21: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Design

21

Mini-Emulator (Single

workgroup)

GPU Kernel

EffectiveInstruction Throughput

Effective Global Memory

Bandwidth

Effective Local Memory

Bandwidth

GPU 1

GPU 2

GPU 3

GPU 4

Relative GPUPerformances

Memory Patterns

Bank Conflicts

InstructionMix

GPU 1

GPU 2

GPU N

……

Instruction and Memory Benchmarks

Static Profiling

Dynamic ProfilingD

evic

e

Pro

file

Perf. Limiter?

Performance Projection

1/32 1/16 1/8 1/4 1/2 1

10

100

1000

10000

Global Memory Read

Global Memory Write

Local Memory Read

Local Memory Write

Occupancy

Th

rou

gh

pu

t (G

B/s

)

Page 22: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Outline

• Introduction• Motivation• Contributions• Design• Evaluation• Conclusion

22

Page 23: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Experimental Setup

• Accelerators:– AMD 7970 : Scalar ALUs, Cache hierarchy– AMD 5870: VLIW ALUs– NVIDIA C2050: Fermi Architecture Cache Hierarchy– NVIDIA C1060: Tesla Architecture

• Simulators: – Multi2simv4.1 for AMD and GPGPU-Sim v3.0 for NVIDIA devices– Methodology agnostic to specific emulator

• Applications:

23

FloydWarshall

FastWalshTrasnform

MatrixMul(global)

MatrixMul(local)

Num Nodes = 192 Array Size = 1048576 Matrix Size = [1024,1024]

Matrix Size = [1024,1024]

Reduction NBody AESEncryptDecrypt

MatrixTranspose

ArraySize =1048576 NumParticles=32768 Width=1536, Height=512

Matrix Size = [1024,1024]

Page 24: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Application Boundedness : AMD GPUs

24

0.01

0.1

1

10

100HD 5870

Pro

ject

ed T

ime

(Nor

mal

ized

)

Fast Walsh Transform

Floyd Warshall

MatrixMultiply (Gmem only)

NbodyAESEncryptDecrypt

Reduction

MatrixMultiply (Lmem)

MatrixTranspose

1

10HD 7970 Gmem Lmem

Compute

gmem gmem gmem

compute

lmem lmemgmem gmem

gmem gmem gmem gmem gmem

compute

compute lmem

Page 25: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Application Boundedness Summary

25

Application AMD 5870

AMD 7970

NVIDIA C1060

NVIDIA C2050

FloydWarshall gmem gmem gmem gmem

FastWalshTransform gmem gmem gmem gmem

MatrixTranpose gmem gmem gmem gmem

MatMul(global) gmem gmem gmem gmem

MatMul(local) local local gmem compute

Reduction gmem gmem gmem compute

NBody compute compute compute compute

AESEncryptDecrypt local compute compute compute

Page 26: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Accuracy of Performance Projection

26

.

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Act

ual

Fast Walsh Transform

Floyd Warshall

MatrixMultiply (Gmem)

NbodyAESEncryptDecr.

Reduction

MatrixMultiply (Lmem)

MatrixTranspose

0.1

1

10

100

Pro

ject

ed

Page 27: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Accuracy of Performance Projection

27

.

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Act

ual

Fast Walsh Transform

Floyd Warshall

MatrixMultiply (Gmem)

NbodyAESEncryptDecr.

Reduction

MatrixMultiply (Lmem)

MatrixTranspose

0.1

1

10

100

Pro

ject

ed

Best Device

Fast Walsh

Floyd Warshal

Matmul(global)

Nbody AES Encrypt Decrypt

Reduction Matmul (local)

Mat Transpose

Actual 7970 7970 5870 7970 2050 7970 7970 2050

Projected 7970 7970 5870 7970 7970 7970 7970 2050

Page 28: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Accuracy of Performance Projection

28

.

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Act

ual

Fast Walsh Transform

Floyd Warshall

MatrixMultiply (Gmem)

NbodyAESEncryptDecr.

Reduction

MatrixMultiply (Lmem)

MatrixTranspose

0.1

1

10

100

Pro

ject

ed

Best Device

Fast Walsh

Floyd Warshal

Matmul(global)

Nbody AES Encrypt Decrypt

Reduction Matmul (local)

Mat Transpose

Actual 7970 7970 5870 7970 2050 7970 7970 2050

Projected 7970 7970 5870 7970 7970 7970 7970 2050

Page 29: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Emulation Overhead – Reduction Kernel

29

65536 131072 262144 524288 10485760.01

0.1

1

10

100

Full Kernel Emulation (C2050) Single Workgroup Emulation (C2050)Full Kernel Emulation (HD 7970) Single Workgroup Emulation (HD 7970)

Data Size (x)

Ke

rne

l E

mu

lati

on

Tim

e (

s)

Page 30: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Outline

• Introduction• Motivation• Contributions• Design• Evaluation• Conclusion

30

Page 31: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

90/10 Paradigm -> 10x10 Paradigm

• Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”)– Narrower focus on applications (10%)– Simplified and specialized accelerators for each classification

• Why?– 10x lower power, 10x faster -> 100x energy efficient

31Figure credit: A. Chien, Salishan Conference 2010

Page 32: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

Conclusion

• We presented a “Mini-emulation” technique for online workload characterization for OpenCL kernels– The approach is shown to be sufficiently accurate for relative

performance projection– The approach has asymptotically lower overhead than projection

using full kernel emulation

• Our technique is shown to work well with multiple architectural families of AMD and NVIDIA GPUs

• With the increasing diversity in accelerators (towards 10x10*), our methodology only becomes more relevant.

*S. Borkar and A. Chien, “The future of microprocessors,” Communications of the ACM, 2011

Lokendra Panwar ([email protected])

32

Page 33: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Thank You

33

Page 34: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Backup

34

Page 35: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Evolution of Microprocessors: 90/10 Paradigm• Derive common cases for applications (90%)

– Broad focus on application workloads

• Architectural improvements for 90% of cases– Design an aggregated generic “core”– Lesser customizability for applications

35Figure credit: A. Chien, Salishan Conference 2010

Page 36: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

90/10 Paradigm -> 10x10 Paradigm

• Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”)– Narrower focus on applications (10%)– Simplified and specialized accelerators for each classification

• Why?– 10x lower power, 10x faster -> 100x energy efficient

36Figure credit: A. Chien, Salishan Conference 2010

Page 37: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.eduLokendra Panwar ([email protected])

Application Boundedness : NVIDIA GPUs

37

0.01

0.1

1

10

100

C1060

Pro

ject

ed T

ime

(Nor

mal

ized

)

Fast Walsh Transform

Floyd Warshall

MatrixMultiply (Gmem only)

NbodyAESEncryptDecrypt

Reduction

MatrixMultiply (Lmem)

MatrixTranspose

0.01

0.1

1

10

100

1000

C2050Gmem

Lmem

Compute

gmem gmem gmem

compute

compute

compute lmem gmem

gmem gmem gmem gmem gmem

compute

compute

compute

Page 38: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

Evaluation: Projection Accuracy (Relative to C1060)

0.1

1

10

100C1060 C2050 HD 5870 HD 7970

Rel

ativ

e E

xecu

tio

n T

ime

Fas

t W

alsh

Tra

nsfo

rm

Flo

yd W

arsh

all

Mat

rixM

ultip

ly (

Gm

em)

Nbo

dy

AE

SE

ncry

ptD

ecr.

Red

uctio

n

Mat

rixM

ultip

ly (

Lmem

)

Mat

rixT

rans

pose

0.1

1

10

100

0

50

100C1060 C2050 HD 5870 HD 7970

Pro

ject

ed R

elat

ive

Exe

cu-

tio

n T

ime

Rel

ativ

e E

rro

r (%

)

Page 39: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

Evaluation: Projection Overhead vs. Actual Kernel Execution of Matrix Multiplication

64 128 256 512 1024 2048 4096 81920.0001

0.001

0.01

0.1

1

10

100Actual Device Execution (C2050)Single Workgroup Emulation (C2050)

Data Size (x = y = z)

Ker

nel

Exe

cuti

on

Tim

e (s

)

Page 40: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Matrix Multiplication

64 128 256 512 10240.1

1

10

100

1000

10000

Full Kernel Emulation (C2050) Single Workgroup Emulation (C2050)

Full Kernel Emulation (HD 7970) Single Workgroup Emulation (HD 7970)

Data Size (x = y = z)

Ker

nel

Em

ula

tio

n T

ime

(s)

Page 41: Online Performance Projection for Clusters with Heterogeneous GPUs

synergy.cs.vt.edu

Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Reduction

65536 131072 262144 524288 10485760.01

0.1

1

10

100

Full Kernel Emulation (C2050)

Single Workgroup Emulation (C2050)

Full Kernel Emulation (HD 7970)

Data Size (x)

Ker

nel

Em

ula

tio

n T

ime

(s)