Center for Embedded Computer Systems University of California, Irvine and San Diego

23
Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark of FPGA Blocks using Parallelizing Code Transformations Supported by Semiconductor Research Corporation Sumit Gupta Manev Luthra, Nikil Dutt, Alex Nicolau Rajesh Gupta

description

Hardware and Interface Synthesis of FPGA Blocks using Parallelizing Code Transformations. Sumit Gupta Manev Luthra, Nikil Dutt, Alex Nicolau Rajesh Gupta. Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark. - PowerPoint PPT Presentation

Transcript of Center for Embedded Computer Systems University of California, Irvine and San Diego

Page 1: Center for Embedded Computer Systems University of California, Irvine and San Diego

Center for Embedded Computer SystemsUniversity of California, Irvine and San Diego

http://www.cecs.uci.edu/~spark

Hardware and Interface Synthesis of FPGA Blocks using

Parallelizing Code Transformations

Supported by Semiconductor Research Corporation

Sumit GuptaManev Luthra, Nikil Dutt, Alex Nicolau

Rajesh Gupta

Page 2: Center for Embedded Computer Systems University of California, Irvine and San Diego

2Copyright Sumit Gupta 2003

Overview of a HW/SW Co-Overview of a HW/SW Co-design Flowdesign Flow

ProcessorCore

MemoryI/O

HardwareBehavioralDescription

SoftwareBehavioralDescription

SoftwareCompiler

HighLevel

Synthesis

Application Specs

HW/SW Partitioning

Interface Synthesis

Logic Synthesis and P&R

C

RTLVHDL

ASIC FPGA

Page 3: Center for Embedded Computer Systems University of California, Irvine and San Diego

3Copyright Sumit Gupta 2003

OutlineOutline Motivation and Background Motivation and Background Our Approach to Our Approach to ParallelizingParallelizing High- High-

Level SynthesisLevel Synthesis Our Approach to Interface Synthesis Our Approach to Interface Synthesis

using Memory Mappingusing Memory Mapping Experimental ResultsExperimental Results

HLS: Multimedia and Image Processing HLS: Multimedia and Image Processing ApplicationsApplications

Interface Synthesis: Case Study of MPEG-1 Interface Synthesis: Case Study of MPEG-1 blockblock

Conclusions and Future WorkConclusions and Future Work

Page 4: Center for Embedded Computer Systems University of California, Irvine and San Diego

4Copyright Sumit Gupta 2003

High Level SynthesisHigh Level Synthesis

M e m o r y

ALUCo

ntr

ol

Data path

d = e - f g = h + i

If NodeT F

c

x = a + bc = a < b

j = d x gl = e + x

x = a + b;c = a < b;if (c) then d = e – f;else g = h + i;j = d x g;l = e + x;

Transform behavioral descriptions to RTL/gate level

From C to CDFG to Architecture

Problem # 1Problem # 1 : :Poor quality of HLS results beyond Poor quality of HLS results beyond straight-line behavioral descriptionsstraight-line behavioral descriptions

Poor/No controllability of Poor/No controllability of the HLSthe HLSresultsresults

Problem # 2Problem # 2 : :

Page 5: Center for Embedded Computer Systems University of California, Irvine and San Diego

5Copyright Sumit Gupta 2003

High-level SynthesisHigh-level Synthesis Well-researched area: from early 1980’sWell-researched area: from early 1980’s

Renewed interest due to new system level design Renewed interest due to new system level design methodologies methodologies

Large number of synthesis optimizations have been Large number of synthesis optimizations have been proposed proposed Either Either operation leveloperation level: algebraic transformations on DSP : algebraic transformations on DSP

codescodes or or logic levellogic level: Don’t Care based control optimizations: Don’t Care based control optimizations In contrast, compiler transformations operate at both In contrast, compiler transformations operate at both

operation level (fine-grain) and source level (coarse-grain) operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler TransformationsParallelizing Compiler Transformations

Different optimization objectives and cost models than HLSDifferent optimization objectives and cost models than HLS Our aimOur aim: Develop Synthesis and Parallelizing Compiler : Develop Synthesis and Parallelizing Compiler

Transformations that are “useful” for HLS Transformations that are “useful” for HLS Beyond scheduling results: in Beyond scheduling results: in Circuit Area and DelayCircuit Area and Delay For large designs with For large designs with complex control flowcomplex control flow (nested (nested

conditionals/loops)conditionals/loops)

Page 6: Center for Embedded Computer Systems University of California, Irvine and San Diego

6Copyright Sumit Gupta 2003

Our Approach: Our Approach: ParallelizingParallelizing HLS HLS (PHLS)(PHLS)

C Input VHDLOutput

Original CDFG

Optimized CDFG

Scheduling& Binding

Source-Level Compiler

Transformations

Scheduling Compiler & Dynamic

Transformations

Optimizing Compiler and Parallelizing Compiler transformations Optimizing Compiler and Parallelizing Compiler transformations applied at applied at Source-levelSource-level (Pre-synthesis) and during (Pre-synthesis) and during SchedulingScheduling Source-level code refinement using Source-level code refinement using Pre-synthesisPre-synthesis

transformationstransformations Code Restructuring by Code Restructuring by SpeculativeSpeculative Code Motions Code Motions Operation Operation replicationreplication to improve concurrency to improve concurrency DynamicDynamic transformations: exploit new opportunities during transformations: exploit new opportunities during

schedulingscheduling

Page 7: Center for Embedded Computer Systems University of California, Irvine and San Diego

7Copyright Sumit Gupta 2003

SPARSPARKK

High High Level Level

SynthesiSynthesis s

FramewoFrameworkrk

Page 8: Center for Embedded Computer Systems University of California, Irvine and San Diego

8Copyright Sumit Gupta 2003

SPARK Parallelizing HLS SPARK Parallelizing HLS FrameworkFramework

C input and C input and SynthesizableSynthesizable RTL VHDL output RTL VHDL output Tool-box Tool-box of Transformations and Heuristicsof Transformations and Heuristics

Each of these can be developed independently of the Each of these can be developed independently of the otherother

Script based Script based control over transformations & control over transformations & heuristicsheuristics

Hierarchical Intermediate Representation (HTGs)Hierarchical Intermediate Representation (HTGs) Retains structural information about design Retains structural information about design

(conditional blocks, loops)(conditional blocks, loops) Enables efficient and structured application of Enables efficient and structured application of

transformationstransformations Complete HLS tool:Complete HLS tool: Does Binding, Control Does Binding, Control

Synthesis and Backend VHDL generationSynthesis and Backend VHDL generation Interconnect Minimizing Resource BindingInterconnect Minimizing Resource Binding

Enables Enables Graphical VisualizationGraphical Visualization of Design of Design description and intermediate resultsdescription and intermediate results

100,000+ lines of C++ code100,000+ lines of C++ code

Page 9: Center for Embedded Computer Systems University of California, Irvine and San Diego

9Copyright Sumit Gupta 2003

Interface SynthesisInterface Synthesis ““Glue” logic to make the hardware and Glue” logic to make the hardware and

software talk to each othersoftware talk to each other Data must be transferred or shared between Data must be transferred or shared between

the hardware and softwarethe hardware and software Our approach to Interface Synthesis:Our approach to Interface Synthesis:

Map common data to a shared memoryMap common data to a shared memory Our focus is on optimizing the number and size of Our focus is on optimizing the number and size of

memory banks used for mappingmemory banks used for mapping Target architecture: Platform with FPGA and Target architecture: Platform with FPGA and

Processor core on same FPGA fabricProcessor core on same FPGA fabric Take advantage of the Memory Blocks available on Take advantage of the Memory Blocks available on

such FPGA platforms for implementing the shared such FPGA platforms for implementing the shared memorymemory

Page 10: Center for Embedded Computer Systems University of California, Irvine and San Diego

10Copyright Sumit Gupta 2003

The Interface HardwareThe Interface Hardware

Interface Logic

CPU CPU BusBus

ProcessProcessor Coreor Core

PeriphePeripheralral

PeriphePeripheralral

Custom Custom LogicLogic

Shared Memory

Main Main MemorMemor

yy

Interface Hardware

System architecture illustrating the System architecture illustrating the interface HWinterface HW

Data exchanged

between HW and SW

resides in the shared

memory

Page 11: Center for Embedded Computer Systems University of California, Irvine and San Diego

11Copyright Sumit Gupta 2003

Target HW Interface Target HW Interface ArchitectureArchitecture

Control

Logic

Memory Controll

er

Bus Interface

Appl. Kernel

Appl. Kernel

Mapped Local

Memory

FPGA FPGA FabricFabric

Pro

cess

or

Core

Pro

cess

or

Core

CPU CPU BusBus

Multiple Ports

Single Port

Page 12: Center for Embedded Computer Systems University of California, Irvine and San Diego

12Copyright Sumit Gupta 2003

Local Memory

An Example of Memory An Example of Memory MappingMapping

Inefficient use of the FPGA fabric by implementing

independent registers

Var NVar 2Var 1

FUFU 11 FUFU MM

N:2M MuxN:2M Mux

Var 1Var 2

FUFU 11 FUFU MM

K:2M MuxK:2M Mux

Var 3Mem 1Mem 1

Efficient use of FPGA resources by utilizing its abundant memory blocks

Var NMem 2Mem 2

Mem KMem K

Page 13: Center for Embedded Computer Systems University of California, Irvine and San Diego

13Copyright Sumit Gupta 2003

Mapping Designs with Mapping Designs with Control FlowControl Flow

Mapping variables to Mapping variables to memory banks is a memory banks is a well-known techniquewell-known technique

What’s new hereWhat’s new here: : Use Use ofof Scheduling info from Scheduling info from

High Level Synthesis High Level Synthesis tool tool

Mutual exclusivity info Mutual exclusivity info in designs with control in designs with control flowflow

Data access patterns Data access patterns resolved only during resolved only during run timerun time So how do we perform So how do we perform

memory mapping at memory mapping at compile time compile time ??

If NodeT F

BB 2BB 1

BB 3

a b

a c

b c

a b

C0

C1

Access Pattern: {(a, b), Access Pattern: {(a, b), (a, c)} or (a, c)} or {(a, b, c), {(a, b, c), (a, b)}(a, b)}

Condition resolved at

run time

a

Page 14: Center for Embedded Computer Systems University of California, Irvine and San Diego

14Copyright Sumit Gupta 2003

Our Approach to Memory Our Approach to Memory MappingMapping

We use We use Scheduling information from SPARK Synthesis Scheduling information from SPARK Synthesis

tooltool Cycles in which variables are accessed (read/write)Cycles in which variables are accessed (read/write)

Mutual exclusivity information from the Mutual exclusivity information from the control flow graphcontrol flow graph The conditions under which variables are accessedThe conditions under which variables are accessed

Capture this information using Multi-color Capture this information using Multi-color Conflict GraphsConflict Graphs

Then create a Variable to Memory Bank Then create a Variable to Memory Bank Mapping that minimizes the number of Mapping that minimizes the number of memory instancesmemory instances

Details available in ICCD 2003 paperDetails available in ICCD 2003 paper

Page 15: Center for Embedded Computer Systems University of California, Irvine and San Diego

15Copyright Sumit Gupta 2003

Experiments for SPARK Experiments for SPARK HLS ToolHLS Tool

Results presented here forResults presented here for Pre-synthesis transformationsPre-synthesis transformations Speculative Code MotionsSpeculative Code Motions Dynamic CSEDynamic CSE

We used We used SPARKSPARK to synthesize designs derived to synthesize designs derived from several industrial designsfrom several industrial designs MPEG-1, MPEG-2, GIMP Image Processing MPEG-1, MPEG-2, GIMP Image Processing

softwaresoftware Case StudyCase Study of Intel Instruction Length Decoder of Intel Instruction Length Decoder

Scheduling ResultsScheduling Results Number of States in Number of States in

FSMFSM Cycles on Longest Path Cycles on Longest Path

through Designthrough Design

VHDL: Logic VHDL: Logic Synthesis Synthesis Critical Path Length Critical Path Length

(ns)(ns) Unit AreaUnit Area

Page 16: Center for Embedded Computer Systems University of California, Irvine and San Diego

16Copyright Sumit Gupta 2003

Target ApplicationsTarget ApplicationsDesignDesign # of # of

IfsIfs# of # of

LoopsLoops# Non-# Non-Empty Empty Basic Basic BlocksBlocks

# of # of OperatiOperati

onsons

MPEG-1 MPEG-1 pred1pred1

44 22 1717 123123

MPEG-1 MPEG-1 pred2pred2

1111 66 4545 287287

MPEG-2 MPEG-2 dp_framdp_fram

ee

1818 44 6161 260260

GIMP GIMP

tilertiler1111 22 3535 150150

Page 17: Center for Embedded Computer Systems University of California, Irvine and San Diego

17Copyright Sumit Gupta 2003

MPEG-1 Pred1 Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

+ Speculative Code Motions

+ Pre-Synthesis Transforms

+ Dynamic CSE

MPEG-1 Pred2 Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResults

Non-speculative CMs: Within BBs & Across Hier Blocks

42%

10%

36%

36%

8%

39%

Overall: 63-66 % improvement in DelayOverall: 63-66 % improvement in Delay

Almost constant Area Almost constant Area

Page 18: Center for Embedded Computer Systems University of California, Irvine and San Diego

18Copyright Sumit Gupta 2003

Non-speculative CMs: Within BBs & Across Hier Blocks

+ Speculative Code Motions

+ Pre-Synthesis Transforms

+ Dynamic CSE

Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResultsMPEG-2 DpFrame Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

GIMP Tiler Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

14%

20%1%

33%

41%

52%

Overall: 48-76 % improvement in DelayOverall: 48-76 % improvement in Delay

Almost constant Area Almost constant Area

Page 19: Center for Embedded Computer Systems University of California, Irvine and San Diego

19Copyright Sumit Gupta 2003

Interface Synthesis: Case Interface Synthesis: Case StudyStudy

Target Platform – Altera’s Nios development Target Platform – Altera’s Nios development board @33.33 MHzboard @33.33 MHz Soft core of the Nios embedded processor (no Soft core of the Nios embedded processor (no

caches)caches) Library of peripheral IP’s like timer, UART, SRAM Library of peripheral IP’s like timer, UART, SRAM

controllercontroller Altera FPGA with 8320 Logic CellsAltera FPGA with 8320 Logic Cells

ToolsTools The SPARK parallelizing high level synthesis The SPARK parallelizing high level synthesis

framework for high level synthesisframework for high level synthesis Leonardo Spectrum and Quartus II for logic Leonardo Spectrum and Quartus II for logic

synthesis and P&R respectivelysynthesis and P&R respectively The Nios specific GNU toolkit for compiling C codeThe Nios specific GNU toolkit for compiling C code

Page 20: Center for Embedded Computer Systems University of California, Irvine and San Diego

20Copyright Sumit Gupta 2003

Example: MPEG-1 Prediction Example: MPEG-1 Prediction BlockBlock

DesignDesign # of # of IfsIfs

# of # of LoopsLoops

# Non-# Non-Empty Empty Basic Basic BlocksBlocks

# of # of OperatiOperati

onsons

MPEG-1 MPEG-1 pred1pred1

44 22 1717 123123

MPEG-1 MPEG-1 pred2pred2

1111 66 4545 287287 MPEG-1 pred1 used MPEG-1 pred1 used 11 out of out of 33 kernels mapped kernels mapped

to HWto HW MPEG-2 pred2 used the remaining MPEG-2 pred2 used the remaining 22 kernels kernels

mapped to HWmapped to HW HW/SW interface consisted ofHW/SW interface consisted of 181 181 words of data words of data

Page 21: Center for Embedded Computer Systems University of California, Irvine and San Diego

21Copyright Sumit Gupta 2003

ConclusionsConclusions Parallelizing code transformations enable a new range Parallelizing code transformations enable a new range

of HLS transformationsof HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results

Possible to be competitive against manually designed Possible to be competitive against manually designed circuits. circuits.

Can enable productivity improvements in microelectronic Can enable productivity improvements in microelectronic designdesign

Built a C-to-VHDL synthesis system with a range of code Built a C-to-VHDL synthesis system with a range of code transformationstransformations Platform for applying Coarse and Fine-grain OptimizationsPlatform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can Tool-box approach where transformations and heuristics can

be developedbe developed Script-based synthesis enables designer to control HLS processScript-based synthesis enables designer to control HLS process Performance improvements of 60-70 % across a number of Performance improvements of 60-70 % across a number of

designsdesigns Presented an interface synthesis approach targeted at Presented an interface synthesis approach targeted at

FPGA based platformsFPGA based platforms Based on a memory mapping algorithm that utilizes scheduling Based on a memory mapping algorithm that utilizes scheduling

information for synthesizing designs with control flowinformation for synthesizing designs with control flow

Page 22: Center for Embedded Computer Systems University of California, Irvine and San Diego

22Copyright Sumit Gupta 2003

SPARK ReleaseSPARK Release Available for downloadAvailable for download

http://www.cecs.uci.edu/~spark User ManualUser Manual

Running the toolRunning the tool Customizing the synthesis scriptsCustomizing the synthesis scripts

TutorialTutorial Synthesis of Portion of Motion Synthesis of Portion of Motion

Compensation algorithm in MPEG-1 Compensation algorithm in MPEG-1 player (along with logic synthesis scripts)player (along with logic synthesis scripts)

Page 23: Center for Embedded Computer Systems University of California, Irvine and San Diego

Thank YouThank You