Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures

Using Modulo Scheduling

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Presented By: Nikhil Bansal

Outline

Introduction coarse-grained reconfigurable architecturescoarse-grained reconfigurable architectures core problem: exploiting parallelismcore problem: exploiting parallelism modulo scheduling problemmodulo scheduling problem

Compiler Framework Modulo Scheduling Algorithm Conclusions and Future Work

Example of Coarse-Grained Architectures: MorphoSys

Topology of MorphoSys Architecture of a Reconfigurable Cell

Ming-Hau Lee et al., University of California, Irvine

Other examples: REMARC, PACT, Chameleon, KressArray, QuickSilver ...

Core Problem: Exploiting Parallelism

Which parallelism makes difference?

Instruction-level parallelism limited parallelism (constrained by dependence) VLIW does a good job

Task(thread)-level parallelism hard to automate lack support in coarse-grained architectures

Loop-level parallelism (pipelining) fit into coarse-grained architectures higher parallelism than ILP

Pipelining Using Modulo Scheduling

Modulo Scheduling (general): A way of pipelining Iterations are overlapped Each iteration is initiated at a fixed interval (II)

For coarse-grained architectures: Where to place an operation? (placement) When to schedule an operation? (scheduling) How to connect operations? (routing) Modulo constraints

Modulo Scheduling Problem (cont.)

n1

n2 n3

n4

fu1 fu3 fu4 fu2

Iteration 1

t=0

t=1

t=2

t=3

t=4

b) space-time representation

steady state(kernel)

n2

n1

n3

n4

prologue

epilogue

II = 1

Pipeline stages = 3

4 operations/cycle for kernel

n1

n2 n3

n4

fu1 fu2

fu3 fu4

dataflow graph

2x2 matrix

a) an example

n2 n4

n3n1

n1

n2 n3

n4

Outline

Introduction Compiler Framework

structure of compiler architecture description and abstraction

Modulo Scheduling Algorithm Conclusion and Future Work

The Structure of DRESC Compiler

C program

IMPACT Frontend

Lcode IR

Dataflow Analysis & Transformation

Modulo SchedulingAlgorithm

ArchitectureDescription

Architecture Abstraction

Simulator Under development

ArchitectureParser

DRESC refers to Dynamically Reconfigurable Embedded Systems Compiler

External tool

The Target Architecture Template

FU

muxc muxamuxb

RFout1 out2

in

src1 src2pred

dst1pred_dst2pred_dst1

ConfigurationRAM

reg

Example of FU and register file

Examples of topology

Generalizing common features of other architectures Using an XML-based language to specify topology, resource

allocation, operations and timing

Architecture Description and Abstraction

XML-based architecture description

Architecture Parser

Architecture Abstraction

MRRG representation

Modulo Routing Resource Graph (MRRG) abstracts architecture for modulo scheduling. It combines features of:

•Modulo reservation table (MRT) from VLIW compilation

•Routing resource graph from FPGA P&R

Specify resource allocation, operation binding, topology and timing.

Definitions of MRRG

MRRG is defined as 3-tuple: G = {V, E, II}

v = (r, t), r is related to resource, t refers to time stampE = {(vm, vn) | t(vm) <= t(vn)}II = initiation interval

Important properties:• modulo: if node (r, tj) is used, all the nodes {(r, tk)| tj

mod II = tk mod II} are used too• asymmetric: no route from vi to vj, if t(vi) > t(vj)

Modulo scheduling problem is transformed to a placement and routing (P&R) problem on MRRG

Transform Components to MRRG

Register allocation is transformed to part of P&R problem,

implicitly solved by the modulo scheduling algorithm.

register modeling is based on Roos2001

FUpred src1 src2

pred_dst1

pred_dst2 dst

source

pred src1 src2

sink

pred_dst1 pred_dst2 dst

RFin

out1 out2

cycle1cycle1

cycle2

in

out1 out2

in

out1 out2

cap

cap

Outline

Introduction Compiler Framework Modulo Scheduling Algorithm

combined placement and routing congestion negotiation simulated annealing results and related work

Conclusions and Future Work

Combined Placement and Routing

Space-time routing resource graph can’t guarantee routability during the placement

Rip-Up op

Routing

Success?No

Yes

Init Placement

&Routing

Re-placement

n1

n2

?

For normal FPGA P&R:

LUT1

LUT2

switch block

LUT

Proposed Algorithm

Rip-Up op

Success?No

Yes

Init P&R, Penalty

Re-P&R op

UpdatePenalty

InitTemperature

Evaluate New P&R

Accept?

Restore op

UpdateTemperature

No

Yes

1. Sort the operations

2. For each II, first generate initial schedule which respects dependency constraints only.

3. The algorithm iteratively reduces resource overuse and tries to come up with a legal schedule• At every iteration, an

operation is ripped up from the existing schedule and is placed randomly

• Connected nets are rerouted accordingly

• A cost function (next slide) is computed to evaluate the new placement and routing

• Simulated annealing strategy is used to decide whether we accept new placement or not.

Cost Function

Allow to overuse resources during P&R

The cost of using one node is computed as follow:

Rip-Up op

Routing

Success?No

Yes

Init P&R

Re-placement

pcapoccoccbasec )(

base: base cost of the node in MRRG

occ: occupancy

cap: capacity of the node

p: penalty factor

factormultpp _

penalty is increased over time as follow:

UpdatePenalty

InitPenalty

Parameters to Tune the Algorithm

Ordering of operations techniques from Llosa2001

Relaxing factor of schedule length difficulty of moving operations VS. more pipeline stages

Parameters of SA algorithm Costs associated with different resources

register file get less base cost

Penalty factor associated with overused resources compromise between scheduling quality and speed

...

Scheduling Results

kernelno. ofops

MII II IPCsched.density

time(sec.)

idct 86 2 3 28.7 44.8% 239

fft 70 3 3 23.3 36.4% 1995

corr 56 1 2 28 43.8% 264

latanal 12 1 1 12 18.8% 6.5

Scheduling results on a 8x8 matrix resembles topology of Morphosys

Algorithm limitations: Scheduling speed is relatively slow Scheduling quality still has space to improve Can’t handle pipelined FUs Can only handle the inner loop of a loop nest

Related Work

Modulo scheduling on clustered VLIWs problem is simpler in nature (no routing).

RaPiD, Garp row-based architecture and scheduling techniques. no multiplexing

PipeRench ring-like architecture is very specific, scheduling techniques are not

general

Z. Huang, S. Malik, DAC2002 either use a full cross-bar, or generate a dedicated datapath for

several loops for pipelining

Outline

Introduction Compiler Framework Exploiting Parallelism Conclusions and Future Work

Conclusions and Future Work

Conclusions: Coarse-grained architectures have distinct features.

Compilers are possible and needed Loop-level parallelism is the right one for coarse-grained

reconfigurable architectures A novel modulo scheduling algorithm and an abstract

architecture representations are developed

Future Work: Improve quality and speed of scheduling algorithm Enlarge the scope of pipelineable loops Techniques to reduce the bottleneck of pipelineable loops, e.g.,

taking into account of distributed memory

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Documents

Transcript of Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins