EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU...

EECS 583 – Class 20

Research Topic 2: Stream

Compilation, GPU Compilation

University of Michigan

December 3, 2012

Guest Speakers Today: Daya Khudia and Mehrzad Samadi

- 1 -

Announcements & Reading Material

Exams graded and will be returned in Wednesday’s

class

This class

» “Orchestrating the Execution of Stream Programs on Multicore

Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN

2008 Conference on Programming Language Design and

Implementation, Jun. 2008.

Next class – Research Topic 3: Security

» “Dynamic taint analysis for automatic detection, analysis, and

signature generation of exploits on commodity software,” James

Newsome and Dawn Song, Proceedings of the Network and

Distributed System Security Symposium, Feb. 2005

Stream Graph Modulo Scheduling

- 3 -

Stream Graph Modulo Scheduling (SGMS)

Coarse grain software pipelining » Equal work distribution

» Communication/computation overlap

» Synchronization costs

Target : Cell processor » Cores with disjoint address spaces

» Explicit copy to access remote data

DMA engine independent of PEs

Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE

(Power PC)

DRAM

SPE0 SPE1 SPE7

- 4 -

Preliminaries Synchronous Data Flow (SDF) [Lee ’87]

StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 {

int i, sum = 0;

for(i=0; i

- 5 -

SGMS Overview

PE0 PE1 PE2 PE3

PE0

T1 T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

- 6 -

SGMS Phases

Fission +

Processor

assignment

Stage

assignment

Code

generation

Load balance Causality

DMA overlap

- 7 -

Processor Assignment: Maximizing Throughputs

i

ij

j

ij

IIaiw

a

)(

1 for all filter i = 1, …, N

for all PE j = 1,…,P

Minimize II

B C

E

D

F

W: 20

W: 20 W: 20

W: 30

W: 50

W: 30

A

B C

E

D

F

A

A

D

B

C E

F

Minimum II: 50

Balanced workload!

Maximum throughput

T2 = 50

A

B

C

D

E

F

PE0

T1 = 170

T1/T2 = 3. 4

PE0 PE1 PE2 PE3

Four Processing Elements

• Assigns each filter to a processor

PE1 PE0 PE2 PE3

W: workload

- 8 -

Need More Than Just Processor Assignment

Assign filters to processors » Goal : Equal work distribution

Graph partitioning?

Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

A

C

D Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

C J

B1

A S

D Speedup = 60/32 ~ 2

- 9 -

Filter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

- 10 -

Integrated Fission + PE Assign

Exact solution based on Integer Linear Programming (ILP)

…

Split/Join overhead

factored in

Objective function- Maximal load on

any PE

» Minimize

Result

» Number of times to “split” each filter

» Filter → processor mapping

- 11 -

Step 2: Forming the Software Pipeline

To achieve speedup » All chunks should execute concurrently

» Communication should be overlapped

Processor assignment alone is insufficient information

A

B

C

A

C B

PE0 PE1

PE0 PE1

Tim

e

A B

A1

B1 A2

A1

B1

A2 A→B

A1

B1

A2 A→B

A3 A→B

Overlap Ai+2 with Bi

X

- 12 -

Stage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality

(producer-consumer dependence) Communication-computation

overlap

Data flow traversal of the stream graph

» Assign stages using above two rules

- 13 -

Stage Assignment Example

A

B1

D

C

B2

J

S

A

S

B1 Stage 0

DMA DMA DMA Stage 1

C B2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

- 14 -

Step 3: Code Generation for Cell

Target the Synergistic Processing Elements (SPEs)

» PS3 – up to 6 SPEs

» QS20 – up to 16 SPEs

One thread / SPE

Challenge

» Making a collection of independent threads implement a software

pipeline

» Adapt kernel-only code schema of a modulo schedule

- 15 -

Complete Example

void spe1_work()

{

char stage[5] = {0};

stage[0] = 1;

for(i=0; i

- 16 -

SGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive S

peed

up

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

- 17 -

SGMS Conclusions

Streamroller

» Efficient mapping of stream programs to multicore

» Coarse grain software pipelining

Performance summary

» 14.7x speedup on 16 cores

» Up to 35% better than greedy solution (11% on average)

Scheduling framework

» Tradeoff memory space vs. load balance

Memory constrained (embedded) systems

Cache based system

- 18 -

Discussion Points

Is it possible to convert stateful filters into stateless?

What if the application does not behave as you expect?

» Filters change execution time?

» Memory faster/slower than expected?

Could this be adapted for a more conventional

multiprocessor with caches?

Can C code be automatically streamized?

Now you have seen 3 forms of software pipelining:

» 1) Instruction level modulo scheduling, 2) Decoupled software

pipelining, 3) Stream graph modulo scheduling

» Where else can it be used?

Compilation for GPUs

- 20 -

Why GPUs?

20

0

250

500

750

1000

1250

1500

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Th

eore

tica

l G

FL

OP

S/s

NVIDIA

GPU

INTEL CPU

GeForce GTX 480

GeForce GTX 280

GeForce 8800 GTX

GeForce 7800 GTX

GeForce 6800 Ultra

GeForce FX 5800

- 21 -

Efficiency of GPUs

21

High

Memory

Bandwidth

GTX 285 : 159 GB/Sec

i7 : 32 GB/Sec

High Flop

Rate i7 :102 GFLOPS

GTX 285 :1062 GFLOPS High DP Flop Rate

i7 : 51 GFLOPS

GTX 285 : 88.5 GFLOPS

GTX 480 : 168 GFLOPS

High Flop

Per Watt

GTX 285 : 5.2 GFLOP/W

i7 : 0.78 GFLOP/W

High Flop

Per Dollar

GTX 285 : 3.54 GFLOP/$

i7 : 0.36 GFLOP/$

- 22 -

GPU Architecture

22

Shared

Regs

0 1

2 3

4 5

6 7

Interconnection Network

Global Memory (Device Memory) PCIe

Bridge

CPU Host

Memory

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

SM 0 SM 1 SM 2 SM 29

- 23 -

CUDA

“Compute Unified Device Architecture”

General purpose programming model

» User kicks off batches of threads on the GPU

Advantages of CUDA

» Interface designed for compute - graphics free API

» Orchestration of on chip cores

» Explicit GPU memory management

» Full support for Integer and bitwise operations

23

- 24 -

Programming Model

24

Host

Kernel 1

Kernel 2

Grid 1

Grid 2

Device T

ime

- 25 -

Grid 1

GPU Scheduling

25

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7

- 26 -

Warp Generation

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

26

Block 2

ThreadId

0 31 32 63 Warp 0 Warp 1

- 27 -

Memory Hierarchy

27

Per Block

Shared Memory

__shared__ int SharedVar

Block 0

Per-thread

Register

int LocalVarArray[10]

Per-thread

Local Memory

int RegisterVar

Thread 0

Grid 0

Per app

Global Memory

Host

__global__ int GlobalVar

__constant__ int ConstVar

Texture TextureVar

Per app

Texture Memory

Per app

Constant Memory

Dev

ice

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU...

Documents

Transcript of EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU...