EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU...
Transcript of EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU...
-
EECS 583 – Class 20
Research Topic 2: Stream
Compilation, GPU Compilation
University of Michigan
December 3, 2012
Guest Speakers Today: Daya Khudia and Mehrzad Samadi
-
- 1 -
Announcements & Reading Material
Exams graded and will be returned in Wednesday’s
class
This class
» “Orchestrating the Execution of Stream Programs on Multicore
Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN
2008 Conference on Programming Language Design and
Implementation, Jun. 2008.
Next class – Research Topic 3: Security
» “Dynamic taint analysis for automatic detection, analysis, and
signature generation of exploits on commodity software,” James
Newsome and Dawn Song, Proceedings of the Network and
Distributed System Security Symposium, Feb. 2005
-
Stream Graph Modulo Scheduling
-
- 3 -
Stream Graph Modulo Scheduling (SGMS)
Coarse grain software pipelining » Equal work distribution
» Communication/computation overlap
» Synchronization costs
Target : Cell processor » Cores with disjoint address spaces
» Explicit copy to access remote data
DMA engine independent of PEs
Filters = operations, cores = function units
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
EIB
PPE
(Power PC)
DRAM
SPE0 SPE1 SPE7
-
- 4 -
Preliminaries Synchronous Data Flow (SDF) [Lee ’87]
StreamIt [Thies ’02]
int->int filter FIR(int N, int wgts[N]) {
work pop 1 push 1 {
int i, sum = 0;
for(i=0; i
-
- 5 -
SGMS Overview
PE0 PE1 PE2 PE3
PE0
T1 T4
T4
T1 ≈ 4
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
Prologue
Epilogue
-
- 6 -
SGMS Phases
Fission +
Processor
assignment
Stage
assignment
Code
generation
Load balance Causality
DMA overlap
-
- 7 -
Processor Assignment: Maximizing Throughputs
i
ij
j
ij
IIaiw
a
)(
1 for all filter i = 1, …, N
for all PE j = 1,…,P
Minimize II
B C
E
D
F
W: 20
W: 20 W: 20
W: 30
W: 50
W: 30
A
B C
E
D
F
A
A
D
B
C E
F
Minimum II: 50
Balanced workload!
Maximum throughput
T2 = 50
A
B
C
D
E
F
PE0
T1 = 170
T1/T2 = 3. 4
PE0 PE1 PE2 PE3
Four Processing Elements
• Assigns each filter to a processor
PE1 PE0 PE2 PE3
W: workload
-
- 8 -
Need More Than Just Processor Assignment
Assign filters to processors » Goal : Equal work distribution
Graph partitioning?
Bin packing?
A
B
D
C
A
B C
D
5
5
40 10
Original stream program
PE0 PE1
B
A
C
D Speedup = 60/40 = 1.5
A
B1
D
C
B2
J
S
Modified stream program
B2
C J
B1
A S
D Speedup = 60/32 ~ 2
-
- 9 -
Filter Fission Choices
PE0 PE1 PE2 PE3
Speedup ~ 4 ?
-
- 10 -
Integrated Fission + PE Assign
Exact solution based on Integer Linear Programming (ILP)
…
Split/Join overhead
factored in
Objective function- Maximal load on
any PE
» Minimize
Result
» Number of times to “split” each filter
» Filter → processor mapping
-
- 11 -
Step 2: Forming the Software Pipeline
To achieve speedup » All chunks should execute concurrently
» Communication should be overlapped
Processor assignment alone is insufficient information
A
B
C
A
C B
PE0 PE1
PE0 PE1
Tim
e
A B
A1
B1 A2
A1
B1
A2 A→B
A1
B1
A2 A→B
A3 A→B
Overlap Ai+2 with Bi
X
-
- 12 -
Stage Assignment
i
j
PE 1
Sj ≥ Si
i
j
DMA
PE 1
PE 2
Si
SDMA > Si
Sj = SDMA+1
Preserve causality
(producer-consumer dependence) Communication-computation
overlap
Data flow traversal of the stream graph
» Assign stages using above two rules
-
- 13 -
Stage Assignment Example
A
B1
D
C
B2
J
S
A
S
B1 Stage 0
DMA DMA DMA Stage 1
C B2
J
Stage 2
D
DMA Stage 3
Stage 4
DMA
PE 0
PE 1
-
- 14 -
Step 3: Code Generation for Cell
Target the Synergistic Processing Elements (SPEs)
» PS3 – up to 6 SPEs
» QS20 – up to 16 SPEs
One thread / SPE
Challenge
» Making a collection of independent threads implement a software
pipeline
» Adapt kernel-only code schema of a modulo schedule
-
- 15 -
Complete Example
void spe1_work()
{
char stage[5] = {0};
stage[0] = 1;
for(i=0; i
-
- 16 -
SGMS(ILP) vs. Greedy
0
1
2
3
4
5
6
7
8
9
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rela
tive S
peed
up
ILP Partitioning Greedy Partitioning Exposed DMA
(MIT method, ASPLOS’06)
• Solver time < 30 seconds for 16 processors
-
- 17 -
SGMS Conclusions
Streamroller
» Efficient mapping of stream programs to multicore
» Coarse grain software pipelining
Performance summary
» 14.7x speedup on 16 cores
» Up to 35% better than greedy solution (11% on average)
Scheduling framework
» Tradeoff memory space vs. load balance
Memory constrained (embedded) systems
Cache based system
-
- 18 -
Discussion Points
Is it possible to convert stateful filters into stateless?
What if the application does not behave as you expect?
» Filters change execution time?
» Memory faster/slower than expected?
Could this be adapted for a more conventional
multiprocessor with caches?
Can C code be automatically streamized?
Now you have seen 3 forms of software pipelining:
» 1) Instruction level modulo scheduling, 2) Decoupled software
pipelining, 3) Stream graph modulo scheduling
» Where else can it be used?
-
Compilation for GPUs
-
- 20 -
Why GPUs?
20
0
250
500
750
1000
1250
1500
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Th
eore
tica
l G
FL
OP
S/s
NVIDIA
GPU
INTEL CPU
GeForce GTX 480
GeForce GTX 280
GeForce 8800 GTX
GeForce 7800 GTX
GeForce 6800 Ultra
GeForce FX 5800
-
- 21 -
Efficiency of GPUs
21
High
Memory
Bandwidth
GTX 285 : 159 GB/Sec
i7 : 32 GB/Sec
High Flop
Rate i7 :102 GFLOPS
GTX 285 :1062 GFLOPS High DP Flop Rate
i7 : 51 GFLOPS
GTX 285 : 88.5 GFLOPS
GTX 480 : 168 GFLOPS
High Flop
Per Watt
GTX 285 : 5.2 GFLOP/W
i7 : 0.78 GFLOP/W
High Flop
Per Dollar
GTX 285 : 3.54 GFLOP/$
i7 : 0.36 GFLOP/$
-
- 22 -
GPU Architecture
22
Shared
Regs
0 1
2 3
4 5
6 7
Interconnection Network
Global Memory (Device Memory) PCIe
Bridge
CPU Host
Memory
Shared
Regs
0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
SM 0 SM 1 SM 2 SM 29
-
- 23 -
CUDA
“Compute Unified Device Architecture”
General purpose programming model
» User kicks off batches of threads on the GPU
Advantages of CUDA
» Interface designed for compute - graphics free API
» Orchestration of on chip cores
» Explicit GPU memory management
» Full support for Integer and bitwise operations
23
-
- 24 -
Programming Model
24
Host
Kernel 1
Kernel 2
Grid 1
Grid 2
Device T
ime
-
- 25 -
Grid 1
GPU Scheduling
25
SM 0
Shared
Regs
0 1
2 3
4 5
6 7
SM 1
Shared
Regs
0 1
2 3
4 5
6 7
SM 2
Shared
Regs
0 1
2 3
4 5
6 7
SM 3
Shared
Regs
0 1
2 3
4 5
6 7
SM 30
Shared
Regs
0 1
2 3
4 5
6 7
-
- 26 -
Warp Generation
Block 0
Block 1
Block 3
Shared
Registers
0 1
2
4 5
3
6 7
SM0
26
Block 2
ThreadId
0 31 32 63 Warp 0 Warp 1
-
- 27 -
Memory Hierarchy
27
Per Block
Shared Memory
__shared__ int SharedVar
Block 0
Per-thread
Register
int LocalVarArray[10]
Per-thread
Local Memory
int RegisterVar
Thread 0
Grid 0
Per app
Global Memory
Host
__global__ int GlobalVar
__constant__ int ConstVar
Texture TextureVar
Per app
Texture Memory
Per app
Constant Memory
Dev
ice