Online Performance Projection for Clusters with Heterogeneous GPUs
description
Transcript of Online Performance Projection for Clusters with Heterogeneous GPUs
synergy.cs.vt.edu
Online Performance Projection for Clusters with Heterogeneous GPUs
Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
Jiayuan Meng, Pavan Balaji (Argonne National Laboratory, USA)
synergy.cs.vt.eduLokendra Panwar ([email protected])
Diversity in Accelerators
2
Nov, 2008 Nov, 2013
Performance Share of Accelerators in Top500 Systems
Source: top500.org
synergy.cs.vt.eduLokendra Panwar ([email protected])
Heterogeneity “Among” Nodes
• Clusters are deploying different accelerators– Different accelerators for different tasks
• Example clusters:– “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs– “Darwin” at LANL: NVIDIA GPUs, AMD GPUs– “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs
3
synergy.cs.vt.eduLokendra Panwar ([email protected])
Heterogeneity “Among” Nodes
• Clusters are deploying different accelerators– Different accelerators for different tasks
• Example clusters:– “Shadowfax” at VBI@VT: NVIDIA GPUs, FPGAs– “Darwin” at LANL: NVIDIA GPUs, AMD GPUs– “Dirac” at NERSC: NVIDIA Tesla and Fermi GPUs
• However …A unified programming model for “all” accelerators: OpenCL– CPUs, GPUs, FPGAs, DSPs
4
synergy.cs.vt.eduLokendra Panwar ([email protected])
Affinity of Tasks to Processors
• Peak performance doesn’t necessarily translate into actual device performance.
5
Reduction GFLOPs Global Memory BW (GB/s)
Actual Time(ms)
NVIDIA C2050 1030 144 0.13
AMD HD5870 2720 154 0.21
synergy.cs.vt.eduLokendra Panwar ([email protected])
Affinity of Tasks to Processors
• Peak performance doesn’t necessarily translate into actual device performance.
6
int main() { cl_int error; cl_platform_id platform; cl_device_id device; cl_uint platforms, devices; // Fetch the Platform and Device IDs error=clGetPlatformIDs(1, &platform, &platforms); error=clGetDeviceIDs(platform, .., &devices); cq = clCreateCommandQueue(context, .., &error); prog=clCreateProgramWithSource(context, 1, srcptr, &srcsize, &error); error=clBuildProgram(prog, 0, NULL, "", NULL, NULL); // Perform the operation error=clEnqueueNDRangeKernel(cq, .., NULL); // Read the resul error=clEnqueueReadBuffer(cq, .., NULL); // Await completion of all the above error=clFinish(cq); }
OpenCL Program
?
Reduction GFLOPs Global Memory BW (GB/s)
Actual Time(ms)
NVIDIA C2050 1030 144 0.13
AMD HD5870 2720 154 0.21
synergy.cs.vt.eduLokendra Panwar ([email protected])
Challenges for Runtime Systems
• It is crucial for heterogeneous runtime systems to embrace different accelerators in clusters w.r.t. performance and power
• Examples of OpenCL runtime systems:– SnuCL– VOCL– SOCL
• Challenges:– Efficiently choose the right device for the right task– Keep the decision making overhead minimal
7
synergy.cs.vt.edu
Our Contributions
• An online workload characterization technique for OpenCL kernels
• Our model projects the relative ranking of different devices with little overhead
• An end-to-end evaluation of our technique for multiple architectural families of AMD and NVIDIA GPUs
Lokendra Panwar ([email protected])
8
synergy.cs.vt.eduLokendra Panwar ([email protected])
Outline
• Introduction• Motivation• Contributions• Design• Evaluation• Conclusion
9
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design
• Goal: – Rank accelerators for a given OpenCL workload
• Accurately AND efficiently– Decision making with minimal overhead
10
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design
• Goal: – Rank accelerators for a given OpenCL workload
• Accurately AND efficiently– Decision making with minimal overhead
• Choices:– Static Code Analysis:
• Fast• Inaccurate, as it does not account for dynamic properties:
– Input data dependence, memory access patterns, dynamic instructions
11
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design
• Goal: – Rank accelerators for a given OpenCL workload
• Accurately AND efficiently– Decision making with minimal overhead
• Choices:– Static Code Analysis:
• Fast• Inaccurate, as it does not account for dynamic properties:
– Input data dependence, memory access patterns, dynamic instructions
– Dynamic Code Analysis:• Higher accuracy• Execute either on actual device or through a “emulator”
– Not always feasible to run on actual devices: Data transfer costs, Clusters are “busy”
– Emulators are very slow 12
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design – Workload Profiling
13
EmulatorOpenCLKernel Memory Patterns
Bank Conflicts
InstructionMix
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design – Workload Profiling
• “Mini-emulation”– Emulate a single workgroup
• Collect dynamic characteristics:– Instruction traces– Global and Local memory transactions and access
patterns• In typical data-parallel workloads, workgroups exhibit similar
runtime characteristics– Asymptotically lower overhead
14
MiniEmulator
OpenCLKernel Memory Patterns
Bank Conflicts
InstructionMix
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design – Device Profiling
15
GPU1
GPU 2
GPU N
……
Instruction and Memory
Microbenchmarks
1/32 1/16 1/8 1/4 1/2 1
10
100
1000
10000
Global Memory Read
Occupancy
Th
rou
gh
pu
t (G
B/s
)
1/32 1/16 1/8 1/4 1/2 1
10
100
1000
10000
Global Memory Read
Occupancy
Th
rou
gh
pu
t (G
B/s
)
1/32 1/16 1/8 1/4 1/2 1
10
100
1000
10000
Global Memory Read
Occupancy
Th
rou
gh
pu
t (G
B/s
)Device Throughput Profiles
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design – Device Profiling
• Build device throughput profiles:– Modified SHOC microbenchmarks to
• Obtain hardware throughput with varying occupancy• Collect throughputs for instructions, global memory and local
memory– Built only once
16
Global and Local memory profile of AMD 7970
1/32 1/16 1/8 1/4 1/2 1 10
100
1000
10000Global Memory ReadGlobal Memory WriteLocal Memory ReadLocal Memory Write
Occupancy
Th
rou
gh
pu
t (G
B/s
)
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design – Find Performance Limiter
17
1/32 1/16 1/8 1/4 1/2 1
10
100
1000
10000
Global Memory Read
Global Memory Write
Local Memory Read
Occupancy
Th
rou
gh
pu
t (G
B/s
)
Memory PatternsBank Conflicts
InstructionMix
Device Profile
Workload Profile
Performance Bound0
0.51
1.52
2.53
3.54
4.55
ComputeGlobal MemLocal Mem
Pro
jec
ted
Tim
e
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design – Find Performance Limiter
• Single workgroup dynamic characteristics Full kernel characteristics– Device occupancy as scaling factor
18
Performance Bound0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
ComputeGlobal MemLocal Mem
Pro
ject
ed
Tim
e
• Compute projected theoretical times:• Instructions• Global memory• Local memory
• GPUs aggressively try to hide latencies of components
• Performance limiter = max(tlocal, tglobal, tcompute)*
• Compare the normalized predicted times and choose best device
*Zhang et. al. A Quantitative Performance Analysis Model for GPU Architectures, HPCA’2011
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design
19
GPU 1
GPU 2
GPU N
……
Instruction and Memory Benchmarks
Static Profiling
Dev
ice
P
rofil
e
1/32 1/16 1/8 1/4 1/2 1
10
100
1000
10000
Global Memory Read
Global Memory Write
Local Memory Read
Local Memory Write
Occupancy
Th
rou
gh
pu
t (G
B/s
)
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design
20
Mini-Emulator (Single
workgroup)
GPU Kernel
Memory Patterns
Bank Conflicts
InstructionMix
GPU 1
GPU 2
GPU N
……
Instruction and Memory Benchmarks
Static Profiling
Dynamic ProfilingD
evic
e
Pro
file
1/32 1/16 1/8 1/4 1/2 1
10
100
1000
10000
Global Memory Read
Global Memory Write
Local Memory Read
Local Memory Write
Occupancy
Th
rou
gh
pu
t (G
B/s
)
synergy.cs.vt.eduLokendra Panwar ([email protected])
Design
21
Mini-Emulator (Single
workgroup)
GPU Kernel
EffectiveInstruction Throughput
Effective Global Memory
Bandwidth
Effective Local Memory
Bandwidth
GPU 1
GPU 2
GPU 3
GPU 4
Relative GPUPerformances
Memory Patterns
Bank Conflicts
InstructionMix
GPU 1
GPU 2
GPU N
……
Instruction and Memory Benchmarks
Static Profiling
Dynamic ProfilingD
evic
e
Pro
file
Perf. Limiter?
Performance Projection
1/32 1/16 1/8 1/4 1/2 1
10
100
1000
10000
Global Memory Read
Global Memory Write
Local Memory Read
Local Memory Write
Occupancy
Th
rou
gh
pu
t (G
B/s
)
synergy.cs.vt.eduLokendra Panwar ([email protected])
Outline
• Introduction• Motivation• Contributions• Design• Evaluation• Conclusion
22
synergy.cs.vt.eduLokendra Panwar ([email protected])
Experimental Setup
• Accelerators:– AMD 7970 : Scalar ALUs, Cache hierarchy– AMD 5870: VLIW ALUs– NVIDIA C2050: Fermi Architecture Cache Hierarchy– NVIDIA C1060: Tesla Architecture
• Simulators: – Multi2simv4.1 for AMD and GPGPU-Sim v3.0 for NVIDIA devices– Methodology agnostic to specific emulator
• Applications:
23
FloydWarshall
FastWalshTrasnform
MatrixMul(global)
MatrixMul(local)
Num Nodes = 192 Array Size = 1048576 Matrix Size = [1024,1024]
Matrix Size = [1024,1024]
Reduction NBody AESEncryptDecrypt
MatrixTranspose
ArraySize =1048576 NumParticles=32768 Width=1536, Height=512
Matrix Size = [1024,1024]
synergy.cs.vt.eduLokendra Panwar ([email protected])
Application Boundedness : AMD GPUs
24
0.01
0.1
1
10
100HD 5870
Pro
ject
ed T
ime
(Nor
mal
ized
)
Fast Walsh Transform
Floyd Warshall
MatrixMultiply (Gmem only)
NbodyAESEncryptDecrypt
Reduction
MatrixMultiply (Lmem)
MatrixTranspose
1
10HD 7970 Gmem Lmem
Compute
gmem gmem gmem
compute
lmem lmemgmem gmem
gmem gmem gmem gmem gmem
compute
compute lmem
synergy.cs.vt.eduLokendra Panwar ([email protected])
Application Boundedness Summary
25
Application AMD 5870
AMD 7970
NVIDIA C1060
NVIDIA C2050
FloydWarshall gmem gmem gmem gmem
FastWalshTransform gmem gmem gmem gmem
MatrixTranpose gmem gmem gmem gmem
MatMul(global) gmem gmem gmem gmem
MatMul(local) local local gmem compute
Reduction gmem gmem gmem compute
NBody compute compute compute compute
AESEncryptDecrypt local compute compute compute
synergy.cs.vt.eduLokendra Panwar ([email protected])
Accuracy of Performance Projection
26
.
0.1
1
10
100C1060 C2050 HD 5870 HD 7970
Act
ual
Fast Walsh Transform
Floyd Warshall
MatrixMultiply (Gmem)
NbodyAESEncryptDecr.
Reduction
MatrixMultiply (Lmem)
MatrixTranspose
0.1
1
10
100
Pro
ject
ed
synergy.cs.vt.eduLokendra Panwar ([email protected])
Accuracy of Performance Projection
27
.
0.1
1
10
100C1060 C2050 HD 5870 HD 7970
Act
ual
Fast Walsh Transform
Floyd Warshall
MatrixMultiply (Gmem)
NbodyAESEncryptDecr.
Reduction
MatrixMultiply (Lmem)
MatrixTranspose
0.1
1
10
100
Pro
ject
ed
Best Device
Fast Walsh
Floyd Warshal
Matmul(global)
Nbody AES Encrypt Decrypt
Reduction Matmul (local)
Mat Transpose
Actual 7970 7970 5870 7970 2050 7970 7970 2050
Projected 7970 7970 5870 7970 7970 7970 7970 2050
synergy.cs.vt.eduLokendra Panwar ([email protected])
Accuracy of Performance Projection
28
.
0.1
1
10
100C1060 C2050 HD 5870 HD 7970
Act
ual
Fast Walsh Transform
Floyd Warshall
MatrixMultiply (Gmem)
NbodyAESEncryptDecr.
Reduction
MatrixMultiply (Lmem)
MatrixTranspose
0.1
1
10
100
Pro
ject
ed
Best Device
Fast Walsh
Floyd Warshal
Matmul(global)
Nbody AES Encrypt Decrypt
Reduction Matmul (local)
Mat Transpose
Actual 7970 7970 5870 7970 2050 7970 7970 2050
Projected 7970 7970 5870 7970 7970 7970 7970 2050
synergy.cs.vt.eduLokendra Panwar ([email protected])
Emulation Overhead – Reduction Kernel
29
65536 131072 262144 524288 10485760.01
0.1
1
10
100
Full Kernel Emulation (C2050) Single Workgroup Emulation (C2050)Full Kernel Emulation (HD 7970) Single Workgroup Emulation (HD 7970)
Data Size (x)
Ke
rne
l E
mu
lati
on
Tim
e (
s)
synergy.cs.vt.eduLokendra Panwar ([email protected])
Outline
• Introduction• Motivation• Contributions• Design• Evaluation• Conclusion
30
synergy.cs.vt.eduLokendra Panwar ([email protected])
90/10 Paradigm -> 10x10 Paradigm
• Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”)– Narrower focus on applications (10%)– Simplified and specialized accelerators for each classification
• Why?– 10x lower power, 10x faster -> 100x energy efficient
31Figure credit: A. Chien, Salishan Conference 2010
synergy.cs.vt.edu
Conclusion
• We presented a “Mini-emulation” technique for online workload characterization for OpenCL kernels– The approach is shown to be sufficiently accurate for relative
performance projection– The approach has asymptotically lower overhead than projection
using full kernel emulation
• Our technique is shown to work well with multiple architectural families of AMD and NVIDIA GPUs
• With the increasing diversity in accelerators (towards 10x10*), our methodology only becomes more relevant.
*S. Borkar and A. Chien, “The future of microprocessors,” Communications of the ACM, 2011
Lokendra Panwar ([email protected])
32
synergy.cs.vt.eduLokendra Panwar ([email protected])
Evolution of Microprocessors: 90/10 Paradigm• Derive common cases for applications (90%)
– Broad focus on application workloads
• Architectural improvements for 90% of cases– Design an aggregated generic “core”– Lesser customizability for applications
35Figure credit: A. Chien, Salishan Conference 2010
synergy.cs.vt.eduLokendra Panwar ([email protected])
90/10 Paradigm -> 10x10 Paradigm
• Simplify and specialized tools (“accelerators”) customized for different purposes (“applications”)– Narrower focus on applications (10%)– Simplified and specialized accelerators for each classification
• Why?– 10x lower power, 10x faster -> 100x energy efficient
36Figure credit: A. Chien, Salishan Conference 2010
synergy.cs.vt.eduLokendra Panwar ([email protected])
Application Boundedness : NVIDIA GPUs
37
0.01
0.1
1
10
100
C1060
Pro
ject
ed T
ime
(Nor
mal
ized
)
Fast Walsh Transform
Floyd Warshall
MatrixMultiply (Gmem only)
NbodyAESEncryptDecrypt
Reduction
MatrixMultiply (Lmem)
MatrixTranspose
0.01
0.1
1
10
100
1000
C2050Gmem
Lmem
Compute
gmem gmem gmem
compute
compute
compute lmem gmem
gmem gmem gmem gmem gmem
compute
compute
compute
synergy.cs.vt.edu
Evaluation: Projection Accuracy (Relative to C1060)
0.1
1
10
100C1060 C2050 HD 5870 HD 7970
Rel
ativ
e E
xecu
tio
n T
ime
Fas
t W
alsh
Tra
nsfo
rm
Flo
yd W
arsh
all
Mat
rixM
ultip
ly (
Gm
em)
Nbo
dy
AE
SE
ncry
ptD
ecr.
Red
uctio
n
Mat
rixM
ultip
ly (
Lmem
)
Mat
rixT
rans
pose
0.1
1
10
100
0
50
100C1060 C2050 HD 5870 HD 7970
Pro
ject
ed R
elat
ive
Exe
cu-
tio
n T
ime
Rel
ativ
e E
rro
r (%
)
synergy.cs.vt.edu
Evaluation: Projection Overhead vs. Actual Kernel Execution of Matrix Multiplication
64 128 256 512 1024 2048 4096 81920.0001
0.001
0.01
0.1
1
10
100Actual Device Execution (C2050)Single Workgroup Emulation (C2050)
Data Size (x = y = z)
Ker
nel
Exe
cuti
on
Tim
e (s
)
synergy.cs.vt.edu
Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Matrix Multiplication
64 128 256 512 10240.1
1
10
100
1000
10000
Full Kernel Emulation (C2050) Single Workgroup Emulation (C2050)
Full Kernel Emulation (HD 7970) Single Workgroup Emulation (HD 7970)
Data Size (x = y = z)
Ker
nel
Em
ula
tio
n T
ime
(s)
synergy.cs.vt.edu
Evaluation: Overhead of Mini-emulation vs. Full Kernel Emulation of Reduction
65536 131072 262144 524288 10485760.01
0.1
1
10
100
Full Kernel Emulation (C2050)
Single Workgroup Emulation (C2050)
Full Kernel Emulation (HD 7970)
Data Size (x)
Ker
nel
Em
ula
tio
n T
ime
(s)