Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based...

Workshop on HPC in India

Programming Models, Languages, and Compilation for

Accelerator-Based ArchitecturesR. Govindarajan

SERC, [email protected]

ATIP 1st Workshop on HPC in India @ SC-09

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 2

Current Trend in HPC Systems Top500 systems have hundreds of

thousand (100,000s) cores Large HPCs. Performance scaling major challenge

No. of cores in a processor/node is increasing!

4 – 6 cores per processor, 16-24 cores/node! Parallelism even at the node level

Top systems use accelerators GPUs and CellBEs 1000s of cores/proc. Elements in a single GPU!


HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware

accelerators GPUs : nVidia, ATI, Accelerators: Clearspeed, Cell BE, … Plethora of Instruction Sets even for SIMD

Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators

Exploit instruction-level parallelism Exploit data-level parallelism on SIMD units Exploit thread-level parallelism on multiple units/multi-cores

Challenges Portability across different generation and platforms Ability to exploit different types of parallelism


Accelerators – Cell BE


Accelerators - 8800 GPU


The Challenge

SSE

CUDA

OpenCL

ArmNeon

AltiVec

AMD CAL


Programming in Accelerator-Based Architectures Develop a framework

Programmed in a higher-level language, and is efficient

Can exploit different types of parallelism on different hardware

Parallelism across heterogeneous functional units

Be portable across platforms – not device specific!


C/C++

CPU

Autovectorizer

SSE/ Altivec

CUDA/OpenCL

CompilernvCC/JIT

CPU

GPUs

PTX/ATI CAL IL

Brook

BrookCompiler

CPU

GPUs

ATI CAL IL

Existing Approaches


StreaMIT

CellBE RAW

StreamITCompiler

Accelerator

CPU

GPUs

DirectX

Runtime

Std. Compiler

OpenMP

Std. Compiler

CPU

GPUs

Existing Approaches (contd.)


Synergistic Execution on Multiple Hetergeneous Cores

What is needed?

Compiler/Runtime System

CellBE

OtherAceel.

Multicores

GPUsSSE

StreamingLang.

MPIOpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.


What is needed?

StreamingLang.

MPIOpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.

CellBE

OtherAceel.

Multicores

GPUsSSE


PLASMA: High-Level IR

Compiler

Runtime System


Stream Programming Model Higher level programming model where nodes

represent computation and channels communication (producer/consumer relation) between them.

Exposes Pipelined parallelism and Task-level parallelism

Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow

Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-

optimal, buffer-optimal, software-pipelined schedules

Mapping applications to Accelerators such as GPUs and Cell BE.


Streamit programs are a hierarchical composition of three basic constructs:

Pipeline SplitJoin

• Round-robin or duplicate splitter

Feedback Loop Stateful filters Peek values

...Filter Filter Filter

Splitter

Stream

Stream

Joiner

Joiner Body Splitter

Loop

The StreamIt Language


More ”natural” than frameworks like CUDA or CTM

Easier learning curve than CUDA No need to think of ”threads” or blocks, StreamIt programs are easier to verify, Schedule can be determined statically.

Why StreamIt on GPUs


Work distribution across multiprocessors GPUs have hundreds of processing pipes! Exploit task-level and data-level parallelism Schedule across the multiprocessors Multiple concurrent threads in SM to exploit DLP

Execution configuration: task granularity and concurrency

Lack of synchronization between the processors of the GPU.

Managing CPU-GPU memory bandwidth

Issues on Mapping StreamIt for GPUs


Stream Graph Execution

Stream Graph Software Pipelined Execution

A

C

D

B

SM1 SM2 SM3 SM4

A1 A2

A3 A4

B1 B2

B3 B4 D1

C1

D2

C2

D3

C3

D4

C4

0123

4567

Pipeline Parallelism

Task Parallelism

Data Parallelism


Our Approach

Our Approach for GPUs Code for SAXPY float->float filter saxpy

{

float a = 2.5f;

work pop 2 push 1 {

float x = pop();

float y = pop();

float s = a * x + y;

push(s);

}

}


Multithreading Identify good execution configuration to exploit the right

amount of data parallelism Memory

Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.

Task Partition between GPU and CPU cores Work scheduling and processor (SM)

assignment problem. Takes into account communication bandwidth restrictions

Our Approach (contd.)


Execution Configuration

Exec. Time of Macro Node = 32

Exec. Time of Macro Node = 16

A0 A1 A127

B0 B1 B127 B0 B1 B127

Total Exec. Time on 2 SMs = MII = 64/2 = 32

More threads for exploiting data-level parallelism


GPUs have a banked memory architecture with a very wide memory channel

Accesses by threads in an SM have to be coalesced

d0 d1 d2 d3 d4 d5 d6 d7

B0 B1 B2 B3 B0 B1 B2 B3

thread0 thread2thread1 thread3

d0 d2 d4 d6 d1 d3 d5 d7

B0 B1 B2 B3 B0 B1 B2 B3

thread0 thread2thread1 thread3

Coalesced Memory Accessing


Execution on CPU and GPU Problem: Partition work across CPU

and GPU Data transfer between GPU and Host memory

required based on the partition! Coalesced access is efficient for GPU, but harmful

for CPU! Transform data before move from/to GPU memory

Reduce the overall execution time, taking into account memory transfer and transform delays!


Scheduling and Mapping

CPU Load:45GPU Load:40DMA Load:40 MII:45

B

A

C

D

E

GPU:20

CPU:20

GPU:20

CPU:15

CPU:10

20

10

10

B

A

C

D

E

CPU:10GPU:20

CPU:20

CPU:80GPU:20

CPU:15GPU:10

CPU:10GPU:25

20

10

10

60

Initial StreamIt Graph Partitioned Graph


Bn-2

Dn-6

En-7

Bn-1 An-1

Bn-3 Cn-3

Dn-5 Cn-5

An

Cn-4

CPU DMA Channel GPU

B

A

C

D

E

GPU:20

CPU:20

GPU:20

CPU:15

CPU:10

20

10

10

Scheduling and Mapping


Compiler Framework

Execute ProfileRuns

Generate Code for Profiling

ConfigurationSelection

StreamItProgram

TaskPartitioning

TaskPartitioning

ILP Partitioner

Heuristic Partitioner

InstancePartitioning

InstancePartitioning

ModuloScheduling

CodeGeneration

CUDACode

+C Code


Significant speedup for synergistic execution

Experimental Results on Tesla

> 5

2x

> 3

2x

> 6

5x


What is needed?

StreamingLang.

MPIOpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.

CellBE

OtherAceel.

Multicores

GPUsSSE



Compiler

Runtime System


Rich abstractions for Functionality Independence from any single architecture Portability without compromises on

efficiency Scale-up and scale down

Single core embedded processor to multi-core workstation

Take advantage of Accelerators (GPU, Cell, …)

Transparent Distributed Memory

PLASMA: Portable Programming for PLASTIC SIMD Accelerators

IR: What should a solution provide?


PLASMA IR

Reduce Add

Par Mul

Slice V

M

Matrix-Vector Multiply

par mul, temp, A[i *n : i *n+n :

1], X

reduce add, Y[I : i+1 : 1], temp


“CPLASM”, a prototype high-level assembly language

Prototype PLASMA IR Compiler

Currently Supported Targets:C (Scalar), SSE3, CUDA (NVIDIA

GPUs) Future Targets:

Cell, ATI, ARM Neon, ... Compiler Optimizations for this

“Vector” IR

Our Framework


Our Framework (contd.)

Plenty of optimization opportunities!


PLASMA IR Performance

Normalized exec. Time comparable to that of hand-tuned library!


Ongoing Work

StreamingLang.

MPI OpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.

CellBE

OtherAceel.

Multicores

GPUsSSE



Compiler

Runtime System

Look at other high level languages !

Target other accelerators


Compiling OpenMP/MPI / X10 Mapping the semantics Exploiting data parallelism and

task parallelism Communication and

synchronization across CPU/GPU/Multiple Nodes

Accelerator-specific optimization Memory layout, memory transfer, …

Performance and Scaling


Thank You !!

My students! IISc and SERC Microsoft and Nvidia ATIP, NSF, all Sponsors ONR

Acknowledgements

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based...

Documents

Transcript of Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based...