Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

21
Exploiting Loop-Level Parallelism for Coarse- Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins Presented By: Nikhil Bansal

description

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling. Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins Presented By: Nikhil Bansal. Outline. Introduction coarse-grained reconfigurable architectures - PowerPoint PPT Presentation

Transcript of Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Page 1: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures

Using Modulo Scheduling

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Presented By: Nikhil Bansal

Page 2: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Outline

Introduction coarse-grained reconfigurable architecturescoarse-grained reconfigurable architectures core problem: exploiting parallelismcore problem: exploiting parallelism modulo scheduling problemmodulo scheduling problem

Compiler Framework Modulo Scheduling Algorithm Conclusions and Future Work

Page 3: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Example of Coarse-Grained Architectures: MorphoSys

Topology of MorphoSys Architecture of a Reconfigurable Cell

Ming-Hau Lee et al., University of California, Irvine

Other examples: REMARC, PACT, Chameleon, KressArray, QuickSilver ...

Page 4: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Core Problem: Exploiting Parallelism

Which parallelism makes difference?

Instruction-level parallelism limited parallelism (constrained by dependence) VLIW does a good job

Task(thread)-level parallelism hard to automate lack support in coarse-grained architectures

Loop-level parallelism (pipelining) fit into coarse-grained architectures higher parallelism than ILP

Page 5: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Pipelining Using Modulo Scheduling

Modulo Scheduling (general): A way of pipelining Iterations are overlapped Each iteration is initiated at a fixed interval (II)

For coarse-grained architectures: Where to place an operation? (placement) When to schedule an operation? (scheduling) How to connect operations? (routing) Modulo constraints

Page 6: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Modulo Scheduling Problem (cont.)

n1

n2 n3

n4

fu1 fu3 fu4 fu2

Iteration 1

t=0

t=1

t=2

t=3

t=4

b) space-time representation

steady state(kernel)

n2

n1

n3

n4

prologue

epilogue

II = 1

Pipeline stages = 3

4 operations/cycle for kernel

n1

n2 n3

n4

fu1 fu2

fu3 fu4

dataflow graph

2x2 matrix

a) an example

n2 n4

n3n1

n1

n2 n3

n4

Page 7: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Outline

Introduction Compiler Framework

structure of compiler architecture description and abstraction

Modulo Scheduling Algorithm Conclusion and Future Work

Page 8: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

The Structure of DRESC Compiler

C program

IMPACT Frontend

Lcode IR

Dataflow Analysis & Transformation

Modulo SchedulingAlgorithm

ArchitectureDescription

Architecture Abstraction

Simulator Under development

ArchitectureParser

DRESC refers to Dynamically Reconfigurable Embedded Systems Compiler

External tool

Page 9: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

The Target Architecture Template

FU

muxc muxamuxb

RFout1 out2

in

src1 src2pred

dst1pred_dst2pred_dst1

ConfigurationRAM

reg

Example of FU and register file

Examples of topology

Generalizing common features of other architectures Using an XML-based language to specify topology, resource

allocation, operations and timing

Page 10: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Architecture Description and Abstraction

XML-based architecture description

Architecture Parser

Architecture Abstraction

MRRG representation

Modulo Routing Resource Graph (MRRG) abstracts architecture for modulo scheduling. It combines features of:

•Modulo reservation table (MRT) from VLIW compilation

•Routing resource graph from FPGA P&R

Specify resource allocation, operation binding, topology and timing.

Page 11: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Definitions of MRRG

MRRG is defined as 3-tuple: G = {V, E, II}

v = (r, t), r is related to resource, t refers to time stampE = {(vm, vn) | t(vm) <= t(vn)}II = initiation interval

Important properties:• modulo: if node (r, tj) is used, all the nodes {(r, tk)| tj

mod II = tk mod II} are used too• asymmetric: no route from vi to vj, if t(vi) > t(vj)

Modulo scheduling problem is transformed to a placement and routing (P&R) problem on MRRG

Page 12: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Transform Components to MRRG

Register allocation is transformed to part of P&R problem,

implicitly solved by the modulo scheduling algorithm.

register modeling is based on Roos2001

FUpred src1 src2

pred_dst1

pred_dst2 dst

source

pred src1 src2

sink

pred_dst1 pred_dst2 dst

RFin

out1 out2

cycle1cycle1

cycle2

in

out1 out2

in

out1 out2

cap

cap

Page 13: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Outline

Introduction Compiler Framework Modulo Scheduling Algorithm

combined placement and routing congestion negotiation simulated annealing results and related work

Conclusions and Future Work

Page 14: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Combined Placement and Routing

Space-time routing resource graph can’t guarantee routability during the placement

Rip-Up op

Routing

Success?No

Yes

Init Placement

&Routing

Re-placement

n1

n2

?

For normal FPGA P&R:

LUT1

LUT2

switch block

LUT

Page 15: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Proposed Algorithm

Rip-Up op

Success?No

Yes

Init P&R, Penalty

Re-P&R op

UpdatePenalty

InitTemperature

Evaluate New P&R

Accept?

Restore op

UpdateTemperature

No

Yes

1. Sort the operations

2. For each II, first generate initial schedule which respects dependency constraints only.

3. The algorithm iteratively reduces resource overuse and tries to come up with a legal schedule• At every iteration, an

operation is ripped up from the existing schedule and is placed randomly

• Connected nets are rerouted accordingly

• A cost function (next slide) is computed to evaluate the new placement and routing

• Simulated annealing strategy is used to decide whether we accept new placement or not.

Page 16: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Cost Function

Allow to overuse resources during P&R

The cost of using one node is computed as follow:

Rip-Up op

Routing

Success?No

Yes

Init P&R

Re-placement

pcapoccoccbasec )(

base: base cost of the node in MRRG

occ: occupancy

cap: capacity of the node

p: penalty factor

factormultpp _

penalty is increased over time as follow:

UpdatePenalty

InitPenalty

Page 17: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Parameters to Tune the Algorithm

Ordering of operations techniques from Llosa2001

Relaxing factor of schedule length difficulty of moving operations VS. more pipeline stages

Parameters of SA algorithm Costs associated with different resources

register file get less base cost

Penalty factor associated with overused resources compromise between scheduling quality and speed

...

Page 18: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Scheduling Results

kernelno. ofops

MII II IPCsched.density

time(sec.)

idct 86 2 3 28.7 44.8% 239

fft 70 3 3 23.3 36.4% 1995

corr 56 1 2 28 43.8% 264

latanal 12 1 1 12 18.8% 6.5

Scheduling results on a 8x8 matrix resembles topology of Morphosys

Algorithm limitations: Scheduling speed is relatively slow Scheduling quality still has space to improve Can’t handle pipelined FUs Can only handle the inner loop of a loop nest

Page 19: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Related Work

Modulo scheduling on clustered VLIWs problem is simpler in nature (no routing).

RaPiD, Garp row-based architecture and scheduling techniques. no multiplexing

PipeRench ring-like architecture is very specific, scheduling techniques are not

general

Z. Huang, S. Malik, DAC2002 either use a full cross-bar, or generate a dedicated datapath for

several loops for pipelining

Page 20: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Outline

Introduction Compiler Framework Exploiting Parallelism Conclusions and Future Work

Page 21: Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins

Conclusions and Future Work

Conclusions: Coarse-grained architectures have distinct features.

Compilers are possible and needed Loop-level parallelism is the right one for coarse-grained

reconfigurable architectures A novel modulo scheduling algorithm and an abstract

architecture representations are developed

Future Work: Improve quality and speed of scheduling algorithm Enlarge the scope of pipelineable loops Techniques to reduce the bottleneck of pipelineable loops, e.g.,

taking into account of distributed memory