Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based...

34
Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc [email protected] ATIP 1 st Workshop on HPC in India @ SC-09

Transcript of Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based...

Page 1: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

Workshop on HPC in India

Programming Models, Languages, and Compilation for

Accelerator-Based ArchitecturesR. Govindarajan

SERC, [email protected]

ATIP 1st Workshop on HPC in India @ SC-09

Page 2: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 2

Current Trend in HPC Systems Top500 systems have hundreds of

thousand (100,000s) cores Large HPCs. Performance scaling major challenge

No. of cores in a processor/node is increasing!

4 – 6 cores per processor, 16-24 cores/node! Parallelism even at the node level

Top systems use accelerators GPUs and CellBEs 1000s of cores/proc. Elements in a single GPU!

Page 3: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 3

HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware

accelerators GPUs : nVidia, ATI, Accelerators: Clearspeed, Cell BE, … Plethora of Instruction Sets even for SIMD

Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators

Exploit instruction-level parallelism Exploit data-level parallelism on SIMD units Exploit thread-level parallelism on multiple units/multi-cores

Challenges Portability across different generation and platforms Ability to exploit different types of parallelism

Page 4: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 4

Accelerators – Cell BE

Page 5: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 5

Accelerators - 8800 GPU

Page 6: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 6

The Challenge

SSE

CUDA

OpenCL

ArmNeon

AltiVec

AMD CAL

Page 7: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 7

Programming in Accelerator-Based Architectures Develop a framework

Programmed in a higher-level language, and is efficient

Can exploit different types of parallelism on different hardware

Parallelism across heterogeneous functional units

Be portable across platforms – not device specific!

Page 8: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 8

C/C++

CPU

Autovectorizer

SSE/ Altivec

CUDA/OpenCL

CompilernvCC/JIT

CPU

GPUs

PTX/ATI CAL IL

Brook

BrookCompiler

CPU

GPUs

ATI CAL IL

Existing Approaches

Page 9: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 9

StreaMIT

CellBE RAW

StreamITCompiler

Accelerator

CPU

GPUs

DirectX

Runtime

Std. Compiler

OpenMP

Std. Compiler

CPU

GPUs

Existing Approaches (contd.)

Page 10: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 10

Synergistic Execution on Multiple Hetergeneous Cores

What is needed?

Compiler/Runtime System

CellBE

OtherAceel.

Multicores

GPUsSSE

StreamingLang.

MPIOpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.

Page 11: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 11

What is needed?

StreamingLang.

MPIOpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.

CellBE

OtherAceel.

Multicores

GPUsSSE

Synergistic Execution on Multiple Hetergeneous Cores

PLASMA: High-Level IR

Compiler

Runtime System

Page 12: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 12

Stream Programming Model Higher level programming model where nodes

represent computation and channels communication (producer/consumer relation) between them.

Exposes Pipelined parallelism and Task-level parallelism

Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow

Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-

optimal, buffer-optimal, software-pipelined schedules

Mapping applications to Accelerators such as GPUs and Cell BE.

Page 13: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 13

Streamit programs are a hierarchical composition of three basic constructs:

Pipeline SplitJoin

• Round-robin or duplicate splitter

Feedback Loop Stateful filters Peek values

...Filter Filter Filter

Splitter

Stream

Stream

Joiner

Joiner Body Splitter

Loop

The StreamIt Language

Page 14: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 14

More ”natural” than frameworks like CUDA or CTM

Easier learning curve than CUDA No need to think of ”threads” or blocks, StreamIt programs are easier to verify, Schedule can be determined statically.

Why StreamIt on GPUs

Page 15: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 15

Work distribution across multiprocessors GPUs have hundreds of processing pipes! Exploit task-level and data-level parallelism Schedule across the multiprocessors Multiple concurrent threads in SM to exploit DLP

Execution configuration: task granularity and concurrency

Lack of synchronization between the processors of the GPU.

Managing CPU-GPU memory bandwidth

Issues on Mapping StreamIt for GPUs

Page 16: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 16

Stream Graph Execution

Stream Graph Software Pipelined Execution

A

C

D

B

SM1 SM2 SM3 SM4

A1 A2

A3 A4

B1 B2

B3 B4 D1

C1

D2

C2

D3

C3

D4

C4

0123

4567

Pipeline Parallelism

Task Parallelism

Data Parallelism

Page 17: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 17

Our Approach

Our Approach for GPUs Code for SAXPY float->float filter saxpy

{

float a = 2.5f;

work pop 2 push 1 {

float x = pop();

float y = pop();

float s = a * x + y;

push(s);

}

}

Page 18: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 18

Multithreading Identify good execution configuration to exploit the right

amount of data parallelism Memory

Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.

Task Partition between GPU and CPU cores Work scheduling and processor (SM)

assignment problem. Takes into account communication bandwidth restrictions

Our Approach (contd.)

Page 19: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 19

Execution Configuration

Exec. Time of Macro Node = 32

Exec. Time of Macro Node = 16

A0 A1 A127

B0 B1 B127 B0 B1 B127

Total Exec. Time on 2 SMs = MII = 64/2 = 32

More threads for exploiting data-level parallelism

Page 20: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 20

GPUs have a banked memory architecture with a very wide memory channel

Accesses by threads in an SM have to be coalesced

d0 d1 d2 d3 d4 d5 d6 d7

B0 B1 B2 B3 B0 B1 B2 B3

thread0 thread2thread1 thread3

d0 d2 d4 d6 d1 d3 d5 d7

B0 B1 B2 B3 B0 B1 B2 B3

thread0 thread2thread1 thread3

Coalesced Memory Accessing

Page 21: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 21

Execution on CPU and GPU Problem: Partition work across CPU

and GPU Data transfer between GPU and Host memory

required based on the partition! Coalesced access is efficient for GPU, but harmful

for CPU! Transform data before move from/to GPU memory

Reduce the overall execution time, taking into account memory transfer and transform delays!

Page 22: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 22

Scheduling and Mapping

CPU Load:45GPU Load:40DMA Load:40 MII:45

B

A

C

D

E

GPU:20

CPU:20

GPU:20

CPU:15

CPU:10

20

10

10

B

A

C

D

E

CPU:10GPU:20

CPU:20

CPU:80GPU:20

CPU:15GPU:10

CPU:10GPU:25

20

10

10

60

Initial StreamIt Graph Partitioned Graph

Page 23: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 23

Bn-2

Dn-6

En-7

Bn-1 An-1

Bn-3 Cn-3

Dn-5 Cn-5

An

Cn-4

CPU DMA Channel GPU

B

A

C

D

E

GPU:20

CPU:20

GPU:20

CPU:15

CPU:10

20

10

10

Scheduling and Mapping

Page 24: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 24

Compiler Framework

Execute ProfileRuns

Generate Code for Profiling

ConfigurationSelection

StreamItProgram

TaskPartitioning

TaskPartitioning

ILP Partitioner

Heuristic Partitioner

InstancePartitioning

InstancePartitioning

ModuloScheduling

CodeGeneration

CUDACode

+C Code

Page 25: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 25

Significant speedup for synergistic execution

Experimental Results on Tesla

> 5

2x

> 3

2x

> 6

5x

Page 26: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 26

What is needed?

StreamingLang.

MPIOpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.

CellBE

OtherAceel.

Multicores

GPUsSSE

Synergistic Execution on Multiple Hetergeneous Cores

PLASMA: High-Level IR

Compiler

Runtime System

Page 27: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 27

Rich abstractions for Functionality Independence from any single architecture Portability without compromises on

efficiency Scale-up and scale down

Single core embedded processor to multi-core workstation

Take advantage of Accelerators (GPU, Cell, …)

Transparent Distributed Memory

PLASMA: Portable Programming for PLASTIC SIMD Accelerators

IR: What should a solution provide?

Page 28: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 28

PLASMA IR

Reduce Add

Par Mul

Slice V

M

Matrix-Vector Multiply

par mul, temp, A[i *n : i *n+n :

1], X

reduce add, Y[I : i+1 : 1], temp

Page 29: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 29

“CPLASM”, a prototype high-level assembly language

Prototype PLASMA IR Compiler

Currently Supported Targets:C (Scalar), SSE3, CUDA (NVIDIA

GPUs) Future Targets:

Cell, ATI, ARM Neon, ... Compiler Optimizations for this

“Vector” IR

Our Framework

Page 30: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 30

Our Framework (contd.)

Plenty of optimization opportunities!

Page 31: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 31

PLASMA IR Performance

Normalized exec. Time comparable to that of hand-tuned library!

Page 32: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 32

Ongoing Work

StreamingLang.

MPI OpenMP

CUDA/OpenCL

ArrayLang. (Matlab)

Parallel Lang.

CellBE

OtherAceel.

Multicores

GPUsSSE

Synergistic Execution on Multiple Hetergeneous Cores

PLASMA: High-Level IR

Compiler

Runtime System

Look at other high level languages !

Target other accelerators

Page 33: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 33

Compiling OpenMP/MPI / X10 Mapping the semantics Exploiting data parallelism and

task parallelism Communication and

synchronization across CPU/GPU/Multiple Nodes

Accelerator-specific optimization Memory layout, memory transfer, …

Performance and Scaling

Page 34: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in.

R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 34

Thank You !!

My students! IISc and SERC Microsoft and Nvidia ATIP, NSF, all Sponsors ONR

Acknowledgements