Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins
description
Transcript of Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins
Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures
Using Modulo Scheduling
Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins
Presented By: Nikhil Bansal
Outline
Introduction coarse-grained reconfigurable architecturescoarse-grained reconfigurable architectures core problem: exploiting parallelismcore problem: exploiting parallelism modulo scheduling problemmodulo scheduling problem
Compiler Framework Modulo Scheduling Algorithm Conclusions and Future Work
Example of Coarse-Grained Architectures: MorphoSys
Topology of MorphoSys Architecture of a Reconfigurable Cell
Ming-Hau Lee et al., University of California, Irvine
Other examples: REMARC, PACT, Chameleon, KressArray, QuickSilver ...
Core Problem: Exploiting Parallelism
Which parallelism makes difference?
Instruction-level parallelism limited parallelism (constrained by dependence) VLIW does a good job
Task(thread)-level parallelism hard to automate lack support in coarse-grained architectures
Loop-level parallelism (pipelining) fit into coarse-grained architectures higher parallelism than ILP
Pipelining Using Modulo Scheduling
Modulo Scheduling (general): A way of pipelining Iterations are overlapped Each iteration is initiated at a fixed interval (II)
For coarse-grained architectures: Where to place an operation? (placement) When to schedule an operation? (scheduling) How to connect operations? (routing) Modulo constraints
Modulo Scheduling Problem (cont.)
n1
n2 n3
n4
fu1 fu3 fu4 fu2
Iteration 1
t=0
t=1
t=2
t=3
t=4
b) space-time representation
steady state(kernel)
n2
n1
n3
n4
prologue
epilogue
II = 1
Pipeline stages = 3
4 operations/cycle for kernel
n1
n2 n3
n4
fu1 fu2
fu3 fu4
dataflow graph
2x2 matrix
a) an example
n2 n4
n3n1
n1
n2 n3
n4
Outline
Introduction Compiler Framework
structure of compiler architecture description and abstraction
Modulo Scheduling Algorithm Conclusion and Future Work
The Structure of DRESC Compiler
C program
IMPACT Frontend
Lcode IR
Dataflow Analysis & Transformation
Modulo SchedulingAlgorithm
ArchitectureDescription
Architecture Abstraction
Simulator Under development
ArchitectureParser
DRESC refers to Dynamically Reconfigurable Embedded Systems Compiler
External tool
The Target Architecture Template
FU
muxc muxamuxb
RFout1 out2
in
src1 src2pred
dst1pred_dst2pred_dst1
ConfigurationRAM
reg
Example of FU and register file
Examples of topology
Generalizing common features of other architectures Using an XML-based language to specify topology, resource
allocation, operations and timing
Architecture Description and Abstraction
XML-based architecture description
Architecture Parser
Architecture Abstraction
MRRG representation
Modulo Routing Resource Graph (MRRG) abstracts architecture for modulo scheduling. It combines features of:
•Modulo reservation table (MRT) from VLIW compilation
•Routing resource graph from FPGA P&R
Specify resource allocation, operation binding, topology and timing.
Definitions of MRRG
MRRG is defined as 3-tuple: G = {V, E, II}
v = (r, t), r is related to resource, t refers to time stampE = {(vm, vn) | t(vm) <= t(vn)}II = initiation interval
Important properties:• modulo: if node (r, tj) is used, all the nodes {(r, tk)| tj
mod II = tk mod II} are used too• asymmetric: no route from vi to vj, if t(vi) > t(vj)
Modulo scheduling problem is transformed to a placement and routing (P&R) problem on MRRG
Transform Components to MRRG
Register allocation is transformed to part of P&R problem,
implicitly solved by the modulo scheduling algorithm.
register modeling is based on Roos2001
FUpred src1 src2
pred_dst1
pred_dst2 dst
source
pred src1 src2
sink
pred_dst1 pred_dst2 dst
RFin
out1 out2
cycle1cycle1
cycle2
in
out1 out2
in
out1 out2
cap
cap
Outline
Introduction Compiler Framework Modulo Scheduling Algorithm
combined placement and routing congestion negotiation simulated annealing results and related work
Conclusions and Future Work
Combined Placement and Routing
Space-time routing resource graph can’t guarantee routability during the placement
Rip-Up op
Routing
Success?No
Yes
Init Placement
&Routing
Re-placement
n1
n2
?
For normal FPGA P&R:
LUT1
LUT2
switch block
LUT
Proposed Algorithm
Rip-Up op
Success?No
Yes
Init P&R, Penalty
Re-P&R op
UpdatePenalty
InitTemperature
Evaluate New P&R
Accept?
Restore op
UpdateTemperature
No
Yes
1. Sort the operations
2. For each II, first generate initial schedule which respects dependency constraints only.
3. The algorithm iteratively reduces resource overuse and tries to come up with a legal schedule• At every iteration, an
operation is ripped up from the existing schedule and is placed randomly
• Connected nets are rerouted accordingly
• A cost function (next slide) is computed to evaluate the new placement and routing
• Simulated annealing strategy is used to decide whether we accept new placement or not.
Cost Function
Allow to overuse resources during P&R
The cost of using one node is computed as follow:
Rip-Up op
Routing
Success?No
Yes
Init P&R
Re-placement
pcapoccoccbasec )(
base: base cost of the node in MRRG
occ: occupancy
cap: capacity of the node
p: penalty factor
factormultpp _
penalty is increased over time as follow:
UpdatePenalty
InitPenalty
Parameters to Tune the Algorithm
Ordering of operations techniques from Llosa2001
Relaxing factor of schedule length difficulty of moving operations VS. more pipeline stages
Parameters of SA algorithm Costs associated with different resources
register file get less base cost
Penalty factor associated with overused resources compromise between scheduling quality and speed
...
Scheduling Results
kernelno. ofops
MII II IPCsched.density
time(sec.)
idct 86 2 3 28.7 44.8% 239
fft 70 3 3 23.3 36.4% 1995
corr 56 1 2 28 43.8% 264
latanal 12 1 1 12 18.8% 6.5
Scheduling results on a 8x8 matrix resembles topology of Morphosys
Algorithm limitations: Scheduling speed is relatively slow Scheduling quality still has space to improve Can’t handle pipelined FUs Can only handle the inner loop of a loop nest
Related Work
Modulo scheduling on clustered VLIWs problem is simpler in nature (no routing).
RaPiD, Garp row-based architecture and scheduling techniques. no multiplexing
PipeRench ring-like architecture is very specific, scheduling techniques are not
general
Z. Huang, S. Malik, DAC2002 either use a full cross-bar, or generate a dedicated datapath for
several loops for pipelining
Outline
Introduction Compiler Framework Exploiting Parallelism Conclusions and Future Work
Conclusions and Future Work
Conclusions: Coarse-grained architectures have distinct features.
Compilers are possible and needed Loop-level parallelism is the right one for coarse-grained
reconfigurable architectures A novel modulo scheduling algorithm and an abstract
architecture representations are developed
Future Work: Improve quality and speed of scheduling algorithm Enlarge the scope of pipelineable loops Techniques to reduce the bottleneck of pipelineable loops, e.g.,
taking into account of distributed memory