Center for Embedded Computer Systems University of California, Irvine and San Diego spark SPARK: A...

26
Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark SPARK: A Parallelizing High-Level Synthesis Framework Supported by Semiconductor Research Corporation & Intel Inc Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Center for Embedded Computer Systems University of California, Irvine and San Diego spark SPARK: A...

Center for Embedded Computer SystemsUniversity of California, Irvine and San Diego

http://www.cecs.uci.edu/~spark

SPARK: A Parallelizing High-Level Synthesis Framework

Supported by Semiconductor Research Corporation & Intel Inc

Sumit GuptaRajesh Gupta, Nikil Dutt, Alex Nicolau

2Copyright Sumit Gupta 2003

System Level SynthesisSystem Level Synthesis

System LevelModel

TaskAnalysis

HW/SWPartitioning

ASIC

ProcessorCore

Memory

FPGA

I/O

HardwareBehavioralDescription

SoftwareBehavioralDescription

SoftwareCompiler

HighLevel

Synthesis

3Copyright Sumit Gupta 2003

High Level SynthesisHigh Level Synthesis

M e m o r y

ALUCo

ntr

ol

Data path

d = e - f g = h + i

If NodeT F

c

x = a + bc = a < b

j = d x gl = e + x

x = a + b;c = a < b;if (c) then d = e – f;else g = h + i;j = d x g;l = e + x;

Transform behavioral descriptions to RTL/gate level

From C to CDFG to Architecture

Problem # 1Problem # 1 : :Poor quality of HLS results beyond Poor quality of HLS results beyond straight-line behavioral descriptionsstraight-line behavioral descriptions

Poor/No controllability of Poor/No controllability of the HLSthe HLSresultsresults

Problem # 2Problem # 2 : :

4Copyright Sumit Gupta 2003

High-level SynthesisHigh-level Synthesis Well-researched area: from early 1980’sWell-researched area: from early 1980’s

Renewed interest due to new system level design Renewed interest due to new system level design methodologies methodologies

Large number of synthesis optimizations have been Large number of synthesis optimizations have been proposed proposed Either Either operation leveloperation level: algebraic transformations on DSP : algebraic transformations on DSP

codescodes or or logic levellogic level: Don’t Care based control optimizations: Don’t Care based control optimizations In contrast, compiler transformations operate at both In contrast, compiler transformations operate at both

operation level (fine-grain) and source level (coarse-grain) operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler TransformationsParallelizing Compiler Transformations

Different optimization objectives and cost models than HLSDifferent optimization objectives and cost models than HLS Our aimOur aim: Develop Synthesis and : Develop Synthesis and ParallelizingParallelizing Compiler Compiler

Transformations that are “useful” for HLS Transformations that are “useful” for HLS Beyond scheduling results: in Beyond scheduling results: in Circuit Area and DelayCircuit Area and Delay For large designs with For large designs with complex control flowcomplex control flow (nested (nested

conditionals/loops)conditionals/loops)

5Copyright Sumit Gupta 2003

Our Approach: Our Approach: ParallelizingParallelizing HLS HLS (PHLS)(PHLS)

C Input VHDLOutput

Original CDFG

Optimized CDFG

Scheduling& Binding

Source-Level Compiler

Transformations

Scheduling Compiler & Dynamic

Transformations

Optimizing Compiler and Parallelizing Compiler transformations Optimizing Compiler and Parallelizing Compiler transformations applied at applied at Source-levelSource-level (Pre-synthesis) and during (Pre-synthesis) and during SchedulingScheduling Source-level code refinement using Source-level code refinement using Pre-synthesisPre-synthesis

transformationstransformations Code Restructuring by Code Restructuring by SpeculativeSpeculative Code Motions Code Motions Operation Operation replicationreplication to improve concurrency to improve concurrency DynamicDynamic transformations: exploit new opportunities during transformations: exploit new opportunities during

schedulingscheduling

6Copyright Sumit Gupta 2003

SPARSPARKK

High High Level Level

SynthesiSynthesis s

FramewoFrameworkrk

7Copyright Sumit Gupta 2003

SPARK Parallelizing HLS SPARK Parallelizing HLS FrameworkFramework C input and C input and SynthesizableSynthesizable RTL VHDL output RTL VHDL output

Tool-box Tool-box of Transformations and Heuristicsof Transformations and Heuristics Each of these can be developed independently of the otherEach of these can be developed independently of the other

Script-basedScript-based application of transformations, passes, and application of transformations, passes, and heuristics: similar to Synopsys Design Compilerheuristics: similar to Synopsys Design Compiler

Hierarchical Intermediate Representation (HTGs)Hierarchical Intermediate Representation (HTGs) Retains structural information about design (conditional blocks, Retains structural information about design (conditional blocks,

loops)loops) Enables efficient and structured application of transformationsEnables efficient and structured application of transformations

Complete HLS tool:Complete HLS tool: Does Resource Binding & Control Does Resource Binding & Control Synthesis Synthesis

Enables Enables Graphical VisualizationGraphical Visualization of Design description and of Design description and intermediate results (CDFG, DFG, HTG)intermediate results (CDFG, DFG, HTG)

BenchmarkedBenchmarked on large set of multimedia & image on large set of multimedia & image processing designsprocessing designs

SPARK System Release SPARK System Release available for downloadavailable for download User ManualUser Manual for running tool and changing synthesis scripts for running tool and changing synthesis scripts TutorialTutorial for the synthesis of a portion of a MPEG player for the synthesis of a portion of a MPEG player

100,000+ lines of C++ code100,000+ lines of C++ code

8Copyright Sumit Gupta 2003

PHLS TransformationsPHLS TransformationsOrganized into Four GroupsOrganized into Four Groups

Pre-SynthesisPre-Synthesis Source-to-Source Transformations Source-to-Source Transformations Loop-Invariant Code Motions, Loop Unrolling, CSELoop-Invariant Code Motions, Loop Unrolling, CSE

SchedulingScheduling synthesis & compiler transformations synthesis & compiler transformations Speculative Code Motions, Multi-cycling, Operation Speculative Code Motions, Multi-cycling, Operation

Chaining, Loop Shifting (Incremental Loop Pipelining Chaining, Loop Shifting (Incremental Loop Pipelining technique)technique)

DynamicDynamic: Transformations applied dynamically : Transformations applied dynamically during schedulingduring scheduling Dynamic CSE & Copy Propagation, Dynamic Branch Dynamic CSE & Copy Propagation, Dynamic Branch

BalancingBalancing Basic CompilerBasic Compiler Transformations Transformations

Copy Propagation, Dead Code Elimination, constant Copy Propagation, Dead Code Elimination, constant propagationpropagationApplication of these transformations is Application of these transformations is

guided by Synthesis Scriptsguided by Synthesis Scripts

9Copyright Sumit Gupta 2003

ExperimentsExperiments We used We used SPARKSPARK to synthesize designs derived from to synthesize designs derived from

several industrial designsseveral industrial designs Example: MPEG-1, MPEG-2, GIMP Image Processing Example: MPEG-1, MPEG-2, GIMP Image Processing

softwaresoftware Case StudyCase Study of Intel Instruction Length Decoder of Intel Instruction Length Decoder

Quantified effects of individual transformations on Quantified effects of individual transformations on QORQOR Pre-synthesis transformationsPre-synthesis transformations Speculative Code Motions, Loop PipelilingSpeculative Code Motions, Loop Pipeliling Dynamic TransformationsDynamic Transformations

Scheduling ResultsScheduling Results Number of States in Number of States in

FSMFSM Cycles on Longest Path Cycles on Longest Path

through Designthrough Design

VHDL: Logic VHDL: Logic Synthesis Synthesis Critical Path Length Critical Path Length

(ns)(ns) Unit AreaUnit Area

10Copyright Sumit Gupta 2003

MPEG-1 Pred1 Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

+ Speculative Code Motions

+ Pre-Synthesis Transforms

+ Dynamic CSE

MPEG-1 Pred2 Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResults

Non-speculative CMs: Within BBs & Across Hier Blocks

42%

10%

36%

36%

8%

39%

Overall: 63-66 % improvement in DelayOverall: 63-66 % improvement in Delay

Almost constant Area Almost constant Area

11Copyright Sumit Gupta 2003

Example Design: ILD Block Example Design: ILD Block from Intelfrom Intel

Case Study: A design derived from the Case Study: A design derived from the Instruction Instruction Length DecoderLength Decoder of the Intel Pentium of the Intel Pentium®® class of class of processorsprocessors

CharacteristicsCharacteristics of Microprocessor functional blocks of Microprocessor functional blocks Low Latency: Single or Dual cycle implementationLow Latency: Single or Dual cycle implementation Consist of several small computationsConsist of several small computations Intermix of control and data logicIntermix of control and data logic

Starting with a sequential, multi-cycle specification, Starting with a sequential, multi-cycle specification, we achieved a fully parallel, single-cycle designwe achieved a fully parallel, single-cycle design Our toolbox approach enables us to develop a script to Our toolbox approach enables us to develop a script to

synthesize applications from different domainssynthesize applications from different domains Final design looks close to the actual implementation Final design looks close to the actual implementation

done by Inteldone by Intel

12Copyright Sumit Gupta 2003

Key Insights from ProjectKey Insights from Project Coarse-grain and Fine-grain Parallelizing Coarse-grain and Fine-grain Parallelizing

transformations and basic compiler transformations transformations and basic compiler transformations are essential and key to achieving high quality of are essential and key to achieving high quality of synthesis resultssynthesis results

Language-level pre-synthesis optimizations are Language-level pre-synthesis optimizations are important due to the high-level of abstraction at the important due to the high-level of abstraction at the level of behavioral C level of behavioral C Also important for coarse-grain design space explorationAlso important for coarse-grain design space exploration

Although a range of (compiler & synthesis) Although a range of (compiler & synthesis) optimizations exist, they have to be carefully guided optimizations exist, they have to be carefully guided by heuristics and scripts to achieve desired resultsby heuristics and scripts to achieve desired results

Transformations from compilers and parallelizing Transformations from compilers and parallelizing compilers do not directly translate over to synthesiscompilers do not directly translate over to synthesis Need to be radically changed with completely different Need to be radically changed with completely different

cost models and guiding principlescost models and guiding principles New parallelizing transformations (or transformations that New parallelizing transformations (or transformations that

are not useful for compilers) have to be developed for are not useful for compilers) have to be developed for synthesissynthesis

13Copyright Sumit Gupta 2003

Key Insights from ProjectKey Insights from Project Designers want script based control over transformations, Designers want script based control over transformations,

passes – similar to Synopsys Design Compilerpasses – similar to Synopsys Design Compiler Designer Insights can be used to guide transformations – Designer Insights can be used to guide transformations –

especially coarse-grain code restructuring for design space especially coarse-grain code restructuring for design space explorationexploration

Optimizations that improve schedule length (cycles) do Optimizations that improve schedule length (cycles) do not necessarily improve circuit delay (due to longer not necessarily improve circuit delay (due to longer critical paths, i.e., clock period)critical paths, i.e., clock period) For example, loop unrolling and loop pipelining: they increase the For example, loop unrolling and loop pipelining: they increase the

number of operations in the design and hence, resource utilization number of operations in the design and hence, resource utilization and in turn, size of multiplexers and controllers increaseand in turn, size of multiplexers and controllers increase

Traditional CDFG and DFG representations used in high-Traditional CDFG and DFG representations used in high-level synthesis are not sufficient for designs with complex level synthesis are not sufficient for designs with complex control flowcontrol flow A Hierarchical intermediate representation (Hierarchical Task A Hierarchical intermediate representation (Hierarchical Task

Graphs – HTGs) is required for retaining control and structural Graphs – HTGs) is required for retaining control and structural information for efficient coarse-level optimizationsinformation for efficient coarse-level optimizations

Full set of data dependencies (RAW, WAR, WAW) are required for Full set of data dependencies (RAW, WAR, WAW) are required for correlating output VHDL and C with input C.correlating output VHDL and C with input C.

14Copyright Sumit Gupta 2003

ConclusionsConclusions Parallelizing code transformations enable a new Parallelizing code transformations enable a new

range of HLS transformationsrange of HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results

Possible to be competitive against manually designed Possible to be competitive against manually designed circuits. circuits.

Can enable productivity improvements in microelectronic Can enable productivity improvements in microelectronic designdesign

Built a C-to-VHDL synthesis system with a range of Built a C-to-VHDL synthesis system with a range of code transformationscode transformations Platform for applying Coarse and Fine-grain OptimizationsPlatform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics Tool-box approach where transformations and heuristics

can be developedcan be developed Enables the designer to find the right Enables the designer to find the right synthesis scriptsynthesis script

for different application domainsfor different application domains Performance improvements of 60-70 % across a number Performance improvements of 60-70 % across a number

of designsof designs We have shown its effectiveness on an Intel designWe have shown its effectiveness on an Intel design

15Copyright Sumit Gupta 2003

SPARK ReleaseSPARK Release Available for downloadAvailable for download

http://www.cecs.uci.edu/~spark User ManualUser Manual

Running the toolRunning the tool Customizing the synthesis scriptsCustomizing the synthesis scripts

TutorialTutorial Synthesis of Portion of the Motion Synthesis of Portion of the Motion

Compensation algorithm in MPEG-1 Compensation algorithm in MPEG-1 playerplayer

Thank YouThank You

17Copyright Sumit Gupta 2003

Ongoing WorkOngoing Work: Interface Synthesis : Interface Synthesis Co-Design Targeting a FPGA Co-Design Targeting a FPGA

PlatformPlatform Developed novel memory mapping Developed novel memory mapping

algorithm to fit memory elements/ algorithm to fit memory elements/ application onto FPGA platformapplication onto FPGA platform

C InputMPEG-1

Pred Block

ExecutionProfiling

Manual HW/SW

PartitioningProcessor

Core

MemoryI/O

Hardware CDescription

SoftwareC

Description

SoftwareCompiler

SPARK High-LevelSynthesis

FPGA

FPGA Platform

InterfaceSynthesis

18Copyright Sumit Gupta 2003

Future Plans or What is Future Plans or What is MissingMissing

Need for ability to specify timing of Need for ability to specify timing of signalssignals

Interface with logic synthesis tools to Interface with logic synthesis tools to enable better module selection, operator enable better module selection, operator chaining/mergingchaining/merging

Time-constrained synthesisTime-constrained synthesis Power Analysis of parallelizing Power Analysis of parallelizing

optimizationsoptimizations More transformations such as loop More transformations such as loop

fusion, range analysis requiredfusion, range analysis required

19Copyright Sumit Gupta 2003

Synthesizable CSynthesizable C ANSI-C front end from Edison Design Group ANSI-C front end from Edison Design Group

(EDG)(EDG) Features of C not supported for synthesisFeatures of C not supported for synthesis

PointersPointers However, Arrays and passing by reference However, Arrays and passing by reference areare

supportedsupported Recursive Function CallsRecursive Function Calls GotosGotos

Features for which support has not been Features for which support has not been implementedimplemented Multi-dimensional arraysMulti-dimensional arrays StructsStructs Continue, BreaksContinue, Breaks

Hardware component generated for each function Hardware component generated for each function A called function is instantiated as a hardware A called function is instantiated as a hardware

component in calling functioncomponent in calling function

20Copyright Sumit Gupta 2003

HTGHTG DFGDFGGraph Graph VisualizationVisualization

21Copyright Sumit Gupta 2003

Resource Utilization GraphResource Utilization Graph

SchedulingScheduling

22Copyright Sumit Gupta 2003

Example of Example of ComplexComplex HTG HTG Example of a real Example of a real

design: MPEG-1 pred2 design: MPEG-1 pred2 functionfunction Just for demonstration; Just for demonstration;

you are not expected to you are not expected to read the textread the text

Multiple nested loops Multiple nested loops and conditionalsand conditionals

23Copyright Sumit Gupta 2003

Target ApplicationsTarget ApplicationsDesignDesign # of # of

IfsIfs# of # of

LoopsLoops# Non-# Non-Empty Empty Basic Basic BlocksBlocks

# of # of OperatiOperati

onsons

MPEG-1 MPEG-1 pred1pred1

44 22 1717 123123

MPEG-1 MPEG-1 pred2pred2

1111 66 4545 287287

MPEG-2 MPEG-2 dp_framdp_fram

ee

1818 44 6161 260260

GIMP GIMP

tilertiler1111 22 3535 150150

24Copyright Sumit Gupta 2003

Non-speculative CMs: Within BBs & Across Hier Blocks

+ Speculative Code Motions

+ Pre-Synthesis Transforms

+ Dynamic CSE

Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResultsMPEG-2 DpFrame Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

GIMP Tiler Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

14%

20%1%

33%

41%

52%

Overall: 48-76 % improvement in DelayOverall: 48-76 % improvement in Delay

Almost constant Area Almost constant Area

25Copyright Sumit Gupta 2003

Case Study: Case Study: IntelIntel Instruction Length Instruction Length DecoderDecoder

Stream ofInstructions

Instruction Length Decoder

FirstInsn

SecondInsn

ThirdInstruction

Instruction BufferInstruction Buffer

26Copyright Sumit Gupta 2003

ILD Synthesis: Resulting ILD Synthesis: Resulting ArchitectureArchitecture

Speculate Operations,Fully Unroll Loop,

Eliminate Loop Index Variable

Multi-cycle Sequential

Architecture

Multi-cycle Sequential

Architecture

Single cycle Parallel

Architecture

Single cycle Parallel

Architecture

Our toolbox approach enables us to develop a Our toolbox approach enables us to develop a script to synthesize applications from different script to synthesize applications from different domainsdomains

Final design looks close to the actual Final design looks close to the actual implementation done by Intelimplementation done by Intel