xPilot A Platform-Based Behavioral Synthesis System

28
xPilot xPilot A Platform-Based A Platform-Based Behavioral Synthesis System Behavioral Synthesis System Prof. Jason Cong Prof. Jason Cong Students: Deming Chen, Yiping Fan, Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang Guoling Han, Wei Jiang, Zhiru Zhang August, 2005 August, 2005 Supported by NSF, GSRC, Altera, Xilinx.

description

xPilot  A Platform-Based Behavioral Synthesis System. Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005. Supported by NSF, GSRC, Altera, Xilinx. Outline. Motivation xPilot system framework Overview of the synthesis engine Scheduling - PowerPoint PPT Presentation

Transcript of xPilot A Platform-Based Behavioral Synthesis System

Page 1: xPilot   A Platform-Based Behavioral Synthesis System

xPilotxPilot A Platform-Based Behavioral Synthesis A Platform-Based Behavioral Synthesis SystemSystem

Prof. Jason CongProf. Jason Cong

Students: Deming Chen, Yiping Fan, Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru ZhangGuoling Han, Wei Jiang, Zhiru Zhang

August, 2005August, 2005

Supported by NSF, GSRC, Altera, Xilinx.

Page 2: xPilot   A Platform-Based Behavioral Synthesis System

2

OutlineOutline

MotivationMotivation

xPilot system frameworkxPilot system framework

Overview of the synthesis engineOverview of the synthesis engine SchedulingScheduling

Resource bindingResource binding

Experimental resultsExperimental results

Page 3: xPilot   A Platform-Based Behavioral Synthesis System

3

Motivation (1)Motivation (1)

Design Complexity is outgrowing the traditional RTL Design Complexity is outgrowing the traditional RTL

methodmethod Feasible to build SoC device with 500M transistors; Billion-Feasible to build SoC device with 500M transistors; Billion-

transistor chips are on the horizontransistor chips are on the horizon

Behavioral synthesis Behavioral synthesis a critical technology for enabling the move a critical technology for enabling the move to higher level of abstractionto higher level of abstraction

Reasons for previous failuresReasons for previous failures• Lack of a compelling reason: design complexity is still manageable a Lack of a compelling reason: design complexity is still manageable a

decade of agodecade of ago• Lack of a solid RTL foundationLack of a solid RTL foundation• Lack of consideration of physical realityLack of consideration of physical reality

Page 4: xPilot   A Platform-Based Behavioral Synthesis System

4

Motivation (2)Motivation (2)

Behavioral Synthesis provides combined advantagesBehavioral Synthesis provides combined advantages Better complexity managementBetter complexity management

• Code size: RTL design ~300KL Code size: RTL design ~300KL Behavioral design 40KL [NEC, Behavioral design 40KL [NEC, ASPDAC04]ASPDAC04]

Shorter verification/simulation cycleShorter verification/simulation cycle• Simulation speed 100X faster than RTL-based methodSimulation speed 100X faster than RTL-based method

Rapid system explorationRapid system exploration• Quick evaluation of different hardware/software boundariesQuick evaluation of different hardware/software boundaries• Fast exploration of multiple micro-architecture alternativesFast exploration of multiple micro-architecture alternatives

Higher quality of resultsHigher quality of results• Full consideration of physical realityFull consideration of physical reality

Page 5: xPilot   A Platform-Based Behavioral Synthesis System

5

xPilot: Platform-Based Behavioral to RTL Synthesis Flow xPilot: Platform-Based Behavioral to RTL Synthesis Flow

Behavioral spec. in C/SystemC

RTL

SSDMSSDM

Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Presynthesis optimizations Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis …

FPGAs/ASICsFPGAs/ASICs

Frontendcompiler

Frontendcompiler

Platform description

Core synthesis optimizations Scheduling Resource binding, e.g., functional unit

binding register/port binding

Page 6: xPilot   A Platform-Based Behavioral Synthesis System

6

System-level Synthesis Data ModelSystem-level Synthesis Data ModelSSDMSSDM (System-level Synthesis Data Model) (System-level Synthesis Data Model)

Hierarchical netlist of concurrent processes and communication Hierarchical netlist of concurrent processes and communication channelschannels

Each leaf process contains a sequential program which is represented Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semanticsby an extended LLVM IR with hardware-specific semantics• Port / IO interfaces, bit-vector manipulations, cycle-level notationsPort / IO interfaces, bit-vector manipulations, cycle-level notations

Page 7: xPilot   A Platform-Based Behavioral Synthesis System

7

Platform Modeling & CharacterizationPlatform Modeling & Characterization

Target platform specificationTarget platform specification High-level resource library with delay/latency/area/power curve High-level resource library with delay/latency/area/power curve

for various input/bitwidth configurationsfor various input/bitwidth configurations• Functional units: adders, ALUs, multipliers, comparators, etc.Functional units: adders, ALUs, multipliers, comparators, etc.• Connectors: mux, demux, etc.Connectors: mux, demux, etc.• Memories: registers, synchronous memories, etc.Memories: registers, synchronous memories, etc.

Chip layout descriptionChip layout description• On-chip resource distributionsOn-chip resource distributions• On-chip interconnect delay/power estimationOn-chip interconnect delay/power estimation

Page 8: xPilot   A Platform-Based Behavioral Synthesis System

8

Scheduling Scheduling Goals Goals A highly versatile scheduling engineA highly versatile scheduling engine

Applicable to a wide range of application domainsApplicable to a wide range of application domains• Computation-intensive, data/memory-intensive, control-intensive, etc.Computation-intensive, data/memory-intensive, control-intensive, etc.• Mixed behavioral & RTLMixed behavioral & RTL

Amenable to a rich set of scheduling constraintsAmenable to a rich set of scheduling constraints• Data dependency constraintsData dependency constraints• Resource constraints: IO ports constraints, memory ports constraints, Resource constraints: IO ports constraints, memory ports constraints,

Functional unit constraints, etc.Functional unit constraints, etc.• Timing constraints: Frequency constraint, Latency constraints, etc.Timing constraints: Frequency constraint, Latency constraints, etc.• Relative IO timing constraints: Cycle-fixed mode, superstate-fixed Relative IO timing constraints: Cycle-fixed mode, superstate-fixed

mode, mode, free-floating mode, etc.free-floating mode, etc.

Retargetable to a variety of design objectivesRetargetable to a variety of design objectives• High performance, small area, low power, etc.High performance, small area, low power, etc.

Page 9: xPilot   A Platform-Based Behavioral Synthesis System

9

Scheduling Scheduling Optimization Capabilities Optimization Capabilities Offers a variety of optimization techniques in a unified Offers a variety of optimization techniques in a unified

frameworkframework Combinational/Sequential non-pipelined/pipelined Combinational/Sequential non-pipelined/pipelined

multi-cycle operation multi-cycle operation Unconditional/Conditional operation chaining Unconditional/Conditional operation chaining Relative schedulingRelative scheduling Considerations of branching probabilities and repetitionsConsiderations of branching probabilities and repetitions Multi-cycle communicationMulti-cycle communication (under development) (under development) Code motion & speculationCode motion & speculation (under development) (under development) Functional / loop pipeliningFunctional / loop pipelining (under development) (under development) Physical layout integration Physical layout integration (to be supported)(to be supported)

Page 10: xPilot   A Platform-Based Behavioral Synthesis System

10

Scheduling Scheduling Current Status Current Status

Design objectiveDesign objective Focus on high-performance designsFocus on high-performance designs

Overall approachOverall approach Use a system of pairwise difference constraints to express all Use a system of pairwise difference constraints to express all

kinds of scheduling constraintskinds of scheduling constraints

Represent the design objective in a linear functionRepresent the design objective in a linear function

The system is immediately solvable via any linear programming The system is immediately solvable via any linear programming solver with integral solutionssolver with integral solutions

Page 11: xPilot   A Platform-Based Behavioral Synthesis System

11

Scheduling Scheduling Design Framework Design Framework

xPilot scheduler

STG (State Transition Graph)

System of pairwise difference constraints

Relative timing constraintsRelative timing constraintsDependency constraintsDependency constraintsFrequency constraintsFrequency constraints

Resource constraints …Resource constraints …

Constraint equations generation

Objective function generation

CDFG

Linear programming solver

LP solution interpretation

User-specified design

constraints& assignments

Target platformmodeling(resource library &

chip layout)

Page 12: xPilot   A Platform-Based Behavioral Synthesis System

12

Example : Greatest Common DivisorExample : Greatest Common Divisor

GCD C descriptionGCD C description

x = inport1;y = inport2;while (x != y) { if ( x > y ) x = x – y; else y = y – x;}*outport = x;

x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);

x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1);

x_2 = x1 – y1;cond3 = (x_2 != y_1);

y_2 = y1 – x1;cond4 = (x_1 != y_2);

x_3 = (x_0, x_1, x_2);*outport = x_3;

T

T

T T

BB1

BB2

BB3 BB4

BB5

Page 13: xPilot   A Platform-Based Behavioral Synthesis System

13

Constraints GenerationConstraints Generation Data dependency constraint Data dependency constraint

Operation Operation vv is data dependent on operation is data dependent on operation u, i.e., (u, v)u, i.e., (u, v)EEs(v) – s(u) s(v) – s(u) 0 0 where schedule variable where schedule variable s(v)s(v) represents the relative schedule of represents the relative schedule of node vnode v

Other constraints can be represented in a similar way …Other constraints can be represented in a similar way …

The constraint equations form a system of pairwise difference The constraint equations form a system of pairwise difference

constraintsconstraints Matrix A is totally unimodularMatrix A is totally unimodular

Feasibility check can be formulated as a single-source shortest path problemFeasibility check can be formulated as a single-source shortest path problem

Optimizations can be performed via any LP solver; the dual problem is Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problemequivalent to a min-cost network flow problem

u: x_1 = (x_0, x_1, x_2);

v: cond2 = (x_1 > y_1);

Page 14: xPilot   A Platform-Based Behavioral Synthesis System

14

Solution by LP SolverSolution by LP Solverx_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);

x_1 = (x_0, x_1, x_2);y_1 = (y_0, y_1, y_2);cond2 = (x_1 > y_1);

x_2 = x1 – y1;cond3 = (x_2 != y_1);

y_2 = y1 – x1;cond4 = (x_1 != y_2);

x_3 = (x_0, x_1, x_2);*outport = x_3;

T

T

T T

BB1

BB2

BB3 BB4

BB5

0

1

Scheduling are Scheduling are performed across performed across the basic block the basic block boundaries boundaries

Page 15: xPilot   A Platform-Based Behavioral Synthesis System

15

Schedule InterpretationSchedule Interpretation

x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1);x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2);*outport = x_3;

if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } }if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; }

x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);

x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);

Page 16: xPilot   A Platform-Based Behavioral Synthesis System

16

Deriving State Transition GraphDeriving State Transition Graph Final STG for GCDFinal STG for GCD

x_0 = inport1;y_0 = inport2;cond1 = (x_0 != y_0);

if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } }if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; }

cond3 || cond4

Page 17: xPilot   A Platform-Based Behavioral Synthesis System

17

Unified Resource BindingUnified Resource Binding

Provides an unified resource sharing framework to Provides an unified resource sharing framework to

optimize for various design objectivesoptimize for various design objectives Simultaneous functional unit binding, register binding and port Simultaneous functional unit binding, register binding and port

bindingbinding

Equipped with advanced techniques to optimized the interconnect Equipped with advanced techniques to optimized the interconnect and steering logic networksand steering logic networks

Guided by a flexible cost evaluation engine to achieve different Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc.objectives, e.g., performance, area, power, etc.

Extendable to exploit physical layout informationExtendable to exploit physical layout information

Page 18: xPilot   A Platform-Based Behavioral Synthesis System

18

Case 1

R5

Case 2

R5

(a)

Case 1

R3

Case 2 R3

(b)

R1 R2 R3 R4 R1 R2 R3 R4

R1 R2 R1 R2

F1 F2 MUX MUX

MUX

F1 MUX

F1 F2

F1

An FU/Register binding ExampleAn FU/Register binding Example

Observations:Observations: Binding has large impact to the resulting performance and costBinding has large impact to the resulting performance and cost

Functional unit and register binding are highly correlatedFunctional unit and register binding are highly correlated

NoteNote: Assume all : Assume all operations and variables operations and variables are compatible for sharingare compatible for sharing

Page 19: xPilot   A Platform-Based Behavioral Synthesis System

19

Drawbacks of Previous WorkDrawbacks of Previous Work Many existing algorithms focus on functional-unit- or register- Many existing algorithms focus on functional-unit- or register-

“number” minimization“number” minimization Technology advances – interconnect effect increasingTechnology advances – interconnect effect increasing

• 51% of the total dynamic power of a microprocessor in 0.13um tech.51% of the total dynamic power of a microprocessor in 0.13um tech.• Up to 80% of the dynamic power in future technologiesUp to 80% of the dynamic power in future technologies

May generate larger amount of multiplexers and interconnects May generate larger amount of multiplexers and interconnects

Unfavorable performance and cost resultsUnfavorable performance and cost results

Optimization for unrealistic goalsOptimization for unrealistic goals Minimize “number” of FUs, registers, or multiplexorsMinimize “number” of FUs, registers, or multiplexors

• Should have detailed datapath models to guide the optimizationShould have detailed datapath models to guide the optimization

No technology specific considerationNo technology specific consideration• Should have platform-specific characterizationsShould have platform-specific characterizations

Page 20: xPilot   A Platform-Based Behavioral Synthesis System

20

xPilot architecture exploration

Iteration

Resource Binding in xPilotResource Binding in xPilot

No

Yes

Register Allocation/Binding

FU Allocation/Binding

Baseline Register Binding

Improved??

STG (State Transition Graph)

User-specified

designconstraints

Target platform

(resource library &

chip layout)

Datapath model for performance-cost

estimation

STG + Best Datapath Models

Page 21: xPilot   A Platform-Based Behavioral Synthesis System

21

Design Space ExplorationDesign Space Exploration

MUL MUL

Datapath for solution (1, 2, 4) (3)

power

delay

pruned

A State Transition Graph A State Transition Graph (STG)(STG)

Exploration phases:Exploration phases: Exploring Node 2: Exploring Node 2:

• (1) (2) two mul(1) (2) two mul

• (1, 2) one mul(1, 2) one mul

Exploring Exploring Node 3:

• (1) (2) (3) three mul

• (1, 2) (3) two mul

• (1, 3) (2) two mul

Exploring Exploring Node 4:

• (1) (2) (3) (4)

• (1, 2, 4) (3)

• (1, 2) (3, 4)

• (1, 2) (3) (4)

• (1, 3, 4) (2)

• (1, 3) (2, 4)

• (1, 3) (2) (4)

….

C1’

C1

C2C2’

>

1*

2*, 3*4*

5*

6+

<

1*

2*

5*

3*

4*

6+

>

<

Compatible GraphsCompatible Graphs

Datapath ModelDatapath Model Curve for Design Curve for Design Space PruningSpace Pruning

Page 22: xPilot   A Platform-Based Behavioral Synthesis System

22

Experimental Results Experimental Results Benchmark Suite Benchmark Suite Benchmark suiteBenchmark suite

PR, MCM:PR, MCM:• DSP kernels: pure additions/subtractions and multiplicationsDSP kernels: pure additions/subtractions and multiplications

CACHECACHE• Cache controller: control-intensive designs with cycle-accurate I/O operationsCache controller: control-intensive designs with cycle-accurate I/O operations

MOTION: MOTION: • Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest

amount of computationsamount of computations IDCT: IDCT:

• JPEG inverse discrete cosine transform: computation intensiveJPEG inverse discrete cosine transform: computation intensive DWT: DWT:

• JPEG2000 discrete wavelet transform: computation intensive with modest control JPEG2000 discrete wavelet transform: computation intensive with modest control flowflow

EDGELOOP: EDGELOOP: • Extracted from H.264 decoder: a very complex design, features a mix of Extracted from H.264 decoder: a very complex design, features a mix of

computation, control, and memory accessescomputation, control, and memory accesses

Page 23: xPilot   A Platform-Based Behavioral Synthesis System

23

Experimental Results Experimental Results Code Size Reduction Code Size Reduction

Page 24: xPilot   A Platform-Based Behavioral Synthesis System

24

Experimental Results Experimental Results Comparison with SPARK On Scheduling Comparison with SPARK On Scheduling

DesignsDesigns Tool/FlowTool/Flow

Synthesis Synthesis

ReportReportAltera Quartus II reportAltera Quartus II report

state#state# reg#reg# fmax (MHz)fmax (MHz) LELE registerregister memmem dspdsp

MOTIONMOTIONsparkspark 1313 1818 170.8170.8 666666 367367 00 44

xpilotxpilot 2424 1111 161.2161.2 888888 266266 00 44

PRPRsparkspark 1313 3636 130.6130.6 508508 491491 00 3232

xpilotxpilot 1313 4040 178.7178.7 1,3491,349 783783 00 00

IDCTIDCT

sparkspark 176176 ~400~400 72.0172.01 10,84710,847 4,5474,547 00 138138

xpilotxpilot 141141 413413 105.53105.53 11,48111,481 5,6275,627 00 6464

xpilot-memxpilot-mem 334334 451451 162.9162.9 9,3519,351 6,0986,098 1,0241,024 6464

CACHECACHEsparkspark Memory unsupported Memory unsupported

xpilot-memxpilot-mem 4747 1616 161.6161.6 371371 265265 30723072 00

SPARK [UCI/UCSD, 2004], a state of the art academic high-SPARK [UCI/UCSD, 2004], a state of the art academic high-

level synthesis toollevel synthesis tool

Page 25: xPilot   A Platform-Based Behavioral Synthesis System

25

On average, xPilot resource binding achieves designs with similar area, and 2.48x higher On average, xPilot resource binding achieves designs with similar area, and 2.48x higher

frequency over Sparkfrequency over Spark

Designs

SPARK xPilot

Fmax Ratio xPilot/SPARK

Resource Usage Fmax Resource Usage Fmax

LE COMBLonely-

RegComb-

RegDSP (MHz) LE COMB

Lonely-Reg

Comb-Reg

DSP (MHz)

PR 1108 815 0 293 0 123.53 1349 713 84 552 0 178.7 1.45

WANG 1217 942 0 275 0 118.89 1105 527 62 516 8 166.11 1.40

LEE 1367 1052 0 315 0 119.32 1585 691 207 687 4 166.61 1.40

MCM 2808 2248 0 560 0 74.87 2402 981 73 1348 0 152.56 2.04

DIR 2425 2034 0 391 6 69.38 3489 1752 110 1627 4 146.8 2.12

FEIG 16170 13136 0 3034 6 37.17 10539 2295 240 8004 4 173.49 4.67

Total 25095 20227 0 4868 12 543.16 20469 6959 776 12734 20 984.27 1.81

Ave Ratio

1 1 1 1 1 1 1.17 0.65 n/a* 2.96 n/a* 2.48 2.48

Experimental Results Experimental Results Comparison with SPARK On Binding Comparison with SPARK On Binding

Page 26: xPilot   A Platform-Based Behavioral Synthesis System

26

Synthesis Results for DWT (JPEG2000)Synthesis Results for DWT (JPEG2000)

Target cycle timeTarget cycle time State#State# fmax(MHz)fmax(MHz) Cycle#Cycle# Latency (ns)Latency (ns) LE#LE# DSP#DSP#

9ns9ns 3434 123.56123.56 48304830 39.139.1 17771777 128128

7ns7ns 3636 147.28147.28 52115211 35.435.4 18621862 128128

5.5ns5.5ns 5151 183.62183.62 69266926 37.837.8 19261926 128128

SettingsSettings Target platform: Altera StratixTarget platform: Altera Stratix RTL synthesis & place-and-route: Altera QuartusII v5.0RTL synthesis & place-and-route: Altera QuartusII v5.0 Simulation: Mentor ModelSim SE6.0Simulation: Mentor ModelSim SE6.0

Design alternativesDesign alternatives

Page 27: xPilot   A Platform-Based Behavioral Synthesis System

27

Experimental Results: ASIC FlowExperimental Results: ASIC FlowMagma RTL to GDSII flow

Technology library: Cadence Generic Standard Cell Library 0.18um

Tradeoff study: 1st column: delay constraint enforced in xPilot 2nd column: control step count of xPilot generated RTL 3rd-5th column: data reported after mapping by Magma tool

DIRDIR StateState##

Cell Cell countcount Area(u2)Area(u2) Delay(ps)Delay(ps) Fmax(MHz)Fmax(MHz) Latency(ps)Latency(ps)

5ns5ns 55 1755517555 12565841256584 21112111 473.71 473.71 1055510555

10ns10ns 33 2307723077 13322031332203 21392139 467.51 467.51 64176417

15ns15ns 22 2838128381 14864871486487 21812181 458.51 458.51 43624362

20ns20ns 22 2718927189 13944511394451 25142514 397.77 397.77 50285028

30ns30ns 11 2779727797 14016421401642 27252725 366.97 366.97 27252725

Page 28: xPilot   A Platform-Based Behavioral Synthesis System

28

Experimental Results: ASIC Flow (cont.)Experimental Results: ASIC Flow (cont.)

LEELEE State#State# Cell Cell countcount Area(u2)Area(u2) Delay(ps)Delay(ps) Fmax(MHz)Fmax(MHz) Latency(ps)Latency(ps)

5ns5ns 88 82428242 509807509807 20662066 484.03 484.03 1652816528

10ns10ns 44 1598915989 708870708870 22542254 443.66 443.66 90169016

15ns15ns 22 1669816698 703381703381 34233423 292.14 292.14 68466846

20ns20ns 22 1525615256 656147656147 42264226 236.63 236.63 84528452

30ns30ns 11 1608516085 697363697363 50705070 197.24 197.24 50705070

MotionMotion State#State# Cell Cell countcount Area(u2)Area(u2) Delay(ps)Delay(ps) Fmax(MHzFmax(MHz

))Latency(ps)Latency(ps)

10ns10ns 3535 1647416474 909721909721 21072107 474.61 474.61 7374573745

15ns15ns 3030 1569515695 847262847262 23582358 424.09 424.09 7074070740

20ns20ns 2828 1646316463 867898867898 24982498 400.32 400.32 6994469944

30ns30ns 2828 1580715807 852573852573 25632563 390.17 390.17 7176471764