Platform-Based Behavior-Level and System-Level...
Transcript of Platform-Based Behavior-Level and System-Level...
PlatformPlatform--Based Based BehaviorBehavior--Level and SystemLevel and System--Level SynthesisLevel Synthesis
Prof. Jason CongProf. Jason [email protected]@cs.ucla.edu
UCLA Computer Science DepartmentUCLA Computer Science Department
OutlineOutlineMotivationMotivation
xPilot xPilot system frameworksystem framework
BehaviorBehavior--level synthesis in xPilotlevel synthesis in xPilotAdvantages of behavioral synthesisAdvantages of behavioral synthesisSchedulingSchedulingResource bindingResource binding
SystemSystem--level synthesis in xPilotlevel synthesis in xPilotSynthesis for ASIP platformsSynthesis for ASIP platformsDesign exploration for heterogeneous Design exploration for heterogeneous MPSoCsMPSoCs
ConclusionsConclusions
ASICsASICs SOC Example: Philips SOC Example: Philips NexperiaNexperia
Philips Philips NexperiaNexperia SoCSoC platform for platform for highhigh--end digital videoend digital video
ACCESSCTL.
MIPS
MPEG
VLIW
VIDEO
MSP
TM-xxxxD$I$
TriMedia CPU
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
.
.
.
DVP SYSTEM SILICON
DEVICE IP BLOCK
PRxxxxD$I$
MIPS CPU
DEVICE IP BLOCK.
.
.DEVICE IP BLOCK
PI B
US
SDRAM
MMI
DVP
MEM
OR
Y B
US
PI B
US
TriMedia™MIPS™GeneralGeneral--purpose purpose scalable RISC scalable RISC processorprocessor
50 to 300+ MHz50 to 300+ MHz3232--bit or 64bit or 64--bitbit
Library of device IP Library of device IP blocksblocks
Image Image coprocessorscoprocessorsDSPsDSPsUARTUART13941394USBUSB
……
Scalable VLIW media Scalable VLIW media processor:processor:
100 to 300+ MHz100 to 300+ MHz3232--bit or 64bit or 64--bitbit
NexperiaNexperia™™systemsystembusesbuses
3232--128 bit128 bit
Courtesy PhilipsCourtesy Philips
FieldField--Programmable SOC Example: Programmable SOC Example: Xilinx VirtexXilinx Virtex--4 FPGA4 FPGA
Courtesy XilinxCourtesy Xilinx
PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture)
Micro-Blaze
Soft core µProc
MicroBlaze 180MHz< ~1300 LUTs166 DMIPS
IBM
Core
Conn
ect™
Bus
IP
IP
H.264/AVC hardware blocks
Needs for Electronic SystemNeeds for Electronic System--Level (ESL) Design AutomationLevel (ESL) Design Automation
Need executable models for systemNeed executable models for system--level specificationlevel specification
Need common specification for SW/HW coNeed common specification for SW/HW co--designdesign
Need better complexity managementNeed better complexity management
ESL LandscapeESL LandscapeModelingModeling
SystemC SystemC ---- OpenSourceOpenSourceSystemVerilogSystemVerilog
Simulation and VerificationSimulation and VerificationBehaviorBehavior--level simulation & verificationlevel simulation & verificationSystemSystem--level simulation & verificationlevel simulation & verificationSystemC provides behaviorSystemC provides behavior--level and systemlevel and system--level synthesis capabilities for level synthesis capabilities for free free ---- rapidly gaining popularityrapidly gaining popularity
SynthesisSynthesisBehaviorBehavior--level synthesis: from behavior specification (e.g. C, SystemC, olevel synthesis: from behavior specification (e.g. C, SystemC, or r MatlabMatlab) to RTL or ) to RTL or netlistsnetlistsSystemSystem--level synthesis: from system specification to system implementatlevel synthesis: from system specification to system implementationion
xPilot: PlatformxPilot: Platform--Based Based Synthesis SystemSynthesis System
xPilot
Behavioral SynthesisProcessor & Architecture
Synthesis
SSDM(System-Level
Synthesis Data Model)
Embedded SoC
Interface Synthesis
Analysis
Mapping
Profiling
Processor Cores+ Executables
Drivers + Glue LogicCustom Logic
xPilot Front EndxPilot Front End
SystemC/CSystemC/C Platform Description Platform Description & Constraints& Constraints
Uniqueness of xPilotUniqueness of xPilotPlatformPlatform--based synthesis and optimizationbased synthesis and optimizationCommunicationCommunication--centric synthesis with interconnect optimizationcentric synthesis with interconnect optimization
OutlineOutlineMotivationMotivation
xPilot xPilot system frameworksystem framework
BehaviorBehavior--level synthesis in xPilotlevel synthesis in xPilotAdvantages of behavioral synthesisAdvantages of behavioral synthesisSchedulingSchedulingResource bindingResource binding
SystemSystem--level synthesis in xPilotlevel synthesis in xPilotSynthesis for ASIP platformsSynthesis for ASIP platformsDesign exploration for heterogeneous Design exploration for heterogeneous MPSoCsMPSoCs
ConclusionsConclusions
xPilot: BehavioralxPilot: Behavioral--toto--RTL Synthesis Flow RTL Synthesis Flow Behavioral spec.
in C/SystemC
RTL + constraints
SSDMSSDM
µArch-generation & RTL/constraints generation
Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …
Presynthesis optimizationsLoop unrolling/shiftingStrength reduction / Tree height reductionBitwidth analysisMemory analysis …
FPGAs/ASICsFPGAs/ASICs
Frontendcompiler
Frontendcompiler
Platform description
Core synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding
xPilot AdvantagesxPilot AdvantagesAdvanced algorithms for platformAdvanced algorithms for platform--based, communicationbased, communication--centric optimizationcentric optimization
E.g. a versatile scheduling engine based on solving system of E.g. a versatile scheduling engine based on solving system of difference constraints (SDC)difference constraints (SDC)
PlatformPlatform--based behavior and system synthesisbased behavior and system synthesisE.g. resource binding based on distributed register architectureE.g. resource binding based on distributed register architecture
Communication/interconnectCommunication/interconnect--centric approachcentric approachE.g. behavior and communication coE.g. behavior and communication co--optimization optimization
Complete validation through final P&R on FPGAsComplete validation through final P&R on FPGAs
Advanced Behavior System Algorithms:Advanced Behavior System Algorithms:Example: Versatile Scheduling Example: Versatile Scheduling Algorithm Based on SDCAlgorithm Based on SDC
Scheduling problem in behavioral synthesis is NPScheduling problem in behavioral synthesis is NP--Complete under general design constraintsComplete under general design constraints
ILPILP--based solutions are versatile but very inefficientbased solutions are versatile but very inefficientExponential time complexityExponential time complexity
Our solution: An efficient and versatile scheduler Our solution: An efficient and versatile scheduler based on SDC (system of difference constraints)based on SDC (system of difference constraints)
Applicable to a broad spectrum of applicationsApplicable to a broad spectrum of applications•• Computation/DataComputation/Data--intensive, controlintensive, control--intensive, memoryintensive, memory--
intensive, partially timed.intensive, partially timed.•• Salable to largeSalable to large--size designs (finishes in a few seconds)size designs (finishes in a few seconds)
Amenable to a rich set of scheduling constraints:Amenable to a rich set of scheduling constraints:•• Resource constraints, latency constraints, frequency Resource constraints, latency constraints, frequency
constraints, relative IO timing constraints.constraints, relative IO timing constraints.Capable of a variety of synthesis optimizations:Capable of a variety of synthesis optimizations:•• Operation chaining, pipelining, multiOperation chaining, pipelining, multi--cycle cycle
communication, incremental scheduling, etc.communication, incremental scheduling, etc.
+4
+2
*5
*1
+3
CS0
* +
+3
*1
*5
+2
+4
CS1
Scheduling Scheduling −− Our ApproachOur ApproachOverall approachOverall approach
Current objective: highCurrent objective: high--performanceperformanceUse a system of integer difference constraints to Use a system of integer difference constraints to express all kinds of scheduling constraintsexpress all kinds of scheduling constraintsRepresent the design objective in a linear functionRepresent the design objective in a linear function
Dependency constraint Dependency constraint •• vv11 vv33 : : xx33 –– xx11 ≥ ≥ 00•• vv22 vv33 : : xx33 –– xx22 ≥ ≥ 00•• vv33 vv55 : : xx44 –– xx33 ≥ ≥ 00•• vv44 vv55 : : xx55 –– xx44 ≥ ≥ 00
Frequency constraint Frequency constraint •• <<vv22 ,, vv55> : > : xx55 –– xx22 ≥ ≥ 11
Resource constraintResource constraint•• <<vv22 ,, vv33>: >: xx33 –– xx22 ≥ ≥ 11
+ *
*
−
+v1 v2
v3
v4
v5
Platform characterization:Platform characterization:•• adder (+/adder (+/––) 2ns) 2ns•• multipilermultipiler (*): 5ns(*): 5ns
Target cycle time: 10nsTarget cycle time: 10nsResource constraint: Only Resource constraint: Only ONE multiplier is availableONE multiplier is available
1 0 -1 0 00 1 -1 0 00 0 1 -1 00 0 0 1 -10 1 0 0 -1
X1X2X3X4X5
0-100-1
≤
A x bTotally Totally unimodularunimodular matrix: matrix: guarantees integral solutionsguarantees integral solutions
Platform Modeling & CharacterizationPlatform Modeling & CharacterizationTarget platform specificationTarget platform specification
HighHigh--level resource library with level resource library with delay/latency/area/power curve for delay/latency/area/power curve for various input/bitwidth configurationsvarious input/bitwidth configurations•• Functional units: adders, ALUs, Functional units: adders, ALUs,
multipliers, comparators, etc.multipliers, comparators, etc.•• Connectors: Connectors: muxmux, , demuxdemux, etc., etc.•• Memories: registers, synchronous Memories: registers, synchronous
memories, etc.memories, etc.
Chip layout descriptionChip layout description•• OnOn--chip resource distributionschip resource distributions•• OnOn--chip interconnect delay/power chip interconnect delay/power
estimationestimation4.74.73.83.82.82.8
3.73.72.92.92.02.0
2.82.81.81.80.580.58
3X3 Delay Matrix for Stratix-EP1S40
ALU
Two binding solutions for Two binding solutions for same behavior:same behavior:Which one is better?Which one is better?Answer is platformAnswer is platform--dependent:dependent:
How large/fast are the How large/fast are the MUX and ALU?MUX and ALU?
MUX
ALU ALU
CommunicationCommunication-- and Interconnectand Interconnect--Centric Synthesis: Centric Synthesis: Example: Example: Use of Distributed RegisterUse of Distributed Register--File ArchitecturesFile Architectures
A scheduled DFG A scheduled DFG with register binding with register binding indicated on each indicated on each variable (assume variable (assume oneone--functional unit functional unit constraint)constraint)
11
22
44
33
11
22
33 22 4411
Binding using Binding using discrete registers discrete registers
Binding using a Binding using a register file: more register file: more efficient design!efficient design!
Island AData-Routing
LogicLocalRegister
File
LocalRegister
File
FUP MUX
Functional Unit PoolMUL ALU
ALU’
Island CIsland C Island B
Input Buffers
Distributed registerDistributed register--file file micromicro--architecture:architecture:
Efficiently use onEfficiently use on--chip chip embedded memoriesembedded memories
Fully explore operation and Fully explore operation and datadata--transfer parallelismtransfer parallelism
Distributed RegisterDistributed Register--File MicroarchitectureFile Microarchitecture
Island A
Data-RoutingLogicLocal
RegisterFile
LocalRegister
File
FUP MUX
Functional Unit Pool
MULALU
ALU’
Island C
Island B
Input Buffers
1,456 1,056720448 336 Dist. RAM(Kb)
168 144 120 96 56 #18Kb BRAM
8000 6000 4000 3000 2000 Xilinx XC-2V
FP-SoC
Island A
Island B
Island C
On-chip memory blocks
On-chip RAM resource on Virtex II
Resource Binding for DRFResource Binding for DRF--MicroarchitectureMicroarchitectureFacts under simplified Facts under simplified assumptionsassumptions
Operations bound onto an island Operations bound onto an island form a chain in the given form a chain in the given scheduled DFGscheduled DFGInterInter--chain data transfers may chain data transfers may share a physical intershare a physical inter--island island connectionconnection
The The number of internumber of inter--island island connections (IIC) connections (IIC) is crucial to is crucial to the QoR of a DRFM instancethe QoR of a DRFM instance
v1
v2
v4
v3
v5 v8 v10
BB CC DD
1
2
3
4
v7
v6
v9
InterInter--island connections = 5island connections = 5(A,B)=(A,D)=1(A,B)=(A,D)=1(A,C)=1, two data transfers (A,C)=1, two data transfers share one connectionshare one connection(C,D)=2(C,D)=2
AAIslandIsland(Chain)(Chain)
Intra-island transfers
Inter-island transfers
Example: Behavior and Communication CoExample: Behavior and Communication Co--Optimization Optimization in Platformin Platform--Based Interface SynthesisBased Interface Synthesis
Focus on sequential communication media (SCM)Focus on sequential communication media (SCM)FIFOsFIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnectCoreConnect. Altera Avalon, etc.) . Altera Avalon, etc.) Order may have dramatic impact on performanceOrder may have dramatic impact on performance•• Best order should guarantee that no data transmission on criticaBest order should guarantee that no data transmission on critical path are delayed l path are delayed
by nonby non--critical transmissioncritical transmissionInterface synthesis for SCMInterface synthesis for SCM
Consider both behavior and communication to determine the optimaConsider both behavior and communication to determine the optimal transmission l transmission orderorder
for (int i=0; i <8; i++) {S1: data[i] = …;
}
int s07 = data[0] + data[7];
Int s16 = data[1] + data[6];…..
Custom Logic 1 Custom logic 2
DCT example
P1 P2
C
PE1 PE2
FIFO
data[8]
Proposed SCM CoProposed SCM Co--Optimization Design FlowOptimization Design Flow
SCOOP (SCM COSCOOP (SCM CO--Optimization)Optimization)
SystemSystem--Level Synthesis Level Synthesis Data ModelData Model
Code transformation and Code transformation and interface generationinterface generation
Drivers + Glue Drivers + Glue LogicsLogics
Front EndFront End
Process NetworkProcess NetworkPlatform Description & Platform Description &
ConstraintsConstraints
Communication Communication order detectionorder detection
Indices compression Indices compression for loop reorderingfor loop reordering
Process Process BehaviorBehavior
Initial Results of Interface SynthesisInitial Results of Interface Synthesis
030043.04%10841903Dot019232.26%420620Masking
648013.25%419483DCT2209616.91%339408Mat_mul0010.45%617689DWT005.63%134142Haar0010.77%290325DCT1
AfterBeforeReductionSCOOPTrad.DesignsRAs CompressTotal latency (Cycle#)
An average of 26% improvement in total latency can be achieved.
Target for sequential communication channelsTarget for sequential communication channelsIn particular, FSL in In particular, FSL in VirtexIIVirtexII
Consider two communicating processesConsider two communicating processes
SystemC/CSystemC/C--toto--RTL Design FlowRTL Design Flow
xPilot xPilot behavioral behavioral synthesissynthesis
SSDM/CDFGSSDM/CDFGBehavioral synthesisBehavioral synthesis
RTL generationRTL generationSSDM/FSMDSSDM/FSMD
FSM with DatapathFSM with Datapathin VHDLin VHDL
Floorplan and/or multiFloorplan and/or multi--cycle path constraintscycle path constraints
SSDM(System-Level
Synthesis Data Model)
SystemC/C specificationSystemC/C specification
FrontFront--end compilerend compiler
Platform description Platform description & constraints& constraints
RTL synthesisRTL synthesis
ASICsASICs/FPGAs platform/FPGAs platform
Preliminary Results of xPilotPreliminary Results of xPilot−−Shorter Simulation/Verification CycleShorter Simulation/Verification Cycle
From other projects:From other projects:Simulation speed on behavior model 100X faster than Simulation speed on behavior model 100X faster than RTLRTL--based method based method [NEC, ASPDAC04][NEC, ASPDAC04]
Our experience:Our experience:MotionMotion--compensation module in a Mpeg4compensation module in a Mpeg4--decoder decoder •• Behavior level (in C language) simulation Behavior level (in C language) simulation
Less than Less than 1 second per frame1 second per frame•• RTL SystemC simulationRTL SystemC simulation
About About 310 second per frame310 second per frame
Preliminary ResultsPreliminary Results of xPilot of xPilot −−Better Complexity ManagementBetter Complexity Management
Significant code size reductionSignificant code size reductionRTL design RTL design Behavioral design: 10x code size reductionBehavioral design: 10x code size reduction
VHDL code generated by UCLA xPilot targeting Altera VHDL code generated by UCLA xPilot targeting Altera Stratix platformStratix platform
Preliminary Results of xPilot Preliminary Results of xPilot −−Rapid System ExplorationRapid System Exploration
Quick evaluation of different hardware/software Quick evaluation of different hardware/software boundariesboundaries
Example: Motion-JPEG implementation-All HW implementation-All SW implementation (using embedded processors)-SW/HW co-design: optimal partitioning?
-Repeated manual RTL coding is not solution!
Preliminary Results on MotionPreliminary Results on Motion--JPEG ExampleJPEG Example
Encoded JPEG Images
RAW Im
ages
Xilinx XUP Board
Preprocess QuantDCT Huffman
Table ModificationOR
0.1170.117
0.1890.189
Exe Time Exe Time (ms)(ms)
126126
126126
Fmax Fmax ((MHZ)MHZ)
14800 14800 ((--38%)38%)
2381223812
Cycle#Cycle#
63456345Model #2Model #2
43064306Model #1Model #1
Area Area (Slice#)(Slice#)
SystemSystem
Preprocess Quant Huffman
Table Modification
HW-DCT
Model #1 : 5 Microblazes
FSL-based communication
Model #2 : 4 Microblazes
+ DCT on FPGA fabrics
Preliminary Result of xPilot Preliminary Result of xPilot −−Better Better QoRQoR (Comparison with UCI/UCSD SPARK)(Comparison with UCI/UCSD SPARK)
1.27 1.27 1.27 1.27 n/an/a2.742.740.480.480.660.661.00 1.00 11111111Ave RatioAve Ratio
1.25 1.25 98.81 98.81 5656173217321002100297997979.30 79.30 334944942256225613231323DIRDIR
1.11 1.11 110.38 110.38 3030128212821207120788788799.40 99.40 004794791857185710621062MCMMCM
1.21 1.21 131.93 131.93 1919659659484484356356109.17 109.17 00220220996996574574LEELEE
1.22 1.22 133.51 133.51 1515588588464464357357109.29 109.29 0026526511571157660660WANGWANG
1.58 1.58 146.84 146.84 161656456441641633133192.85 92.85 00247247981981588588PRPR
(FF)(FF)(LUT)(LUT)(FF)(FF)(LUT)(LUT)(MHz)(MHz)DSPDSP
SliceSliceSliceSliceSliceSlice(MHz)(MHz)DSPDSP
SliceSliceSliceSliceSliceSlice
xPilot xPilot /SPARK/SPARK
FmaxFmaxResource UsageResource UsageFmaxFmaxResource UsageResource Usage
Delay Delay Ratio Ratio
xPilotxPilotSPARKSPARK
DesignsDesigns
Device setting: Xilinx VirtexDevice setting: Xilinx Virtex--II pro (xc2v4000 II pro (xc2v4000 --6)6)
Target frequency: 200 MHzTarget frequency: 200 MHz
OutlineOutlineMotivationMotivation
xPilot xPilot system frameworksystem framework
BehaviorBehavior--level synthesis in xPilotlevel synthesis in xPilotAdvantages of behavioral synthesisAdvantages of behavioral synthesisSchedulingSchedulingResource bindingResource binding
SystemSystem--level synthesis in xPilotlevel synthesis in xPilotSynthesis for ASIP platformsSynthesis for ASIP platformsDesign exploration for heterogeneous Design exploration for heterogeneous MPSoCsMPSoCs
ConclusionsConclusions
Design Exploration for Heterogeneous Design Exploration for Heterogeneous MPSoCMPSoC PlatformsPlatformsHeterogeneous Heterogeneous MPSoCsMPSoCs explorationexploration
ProcessorsProcessors•• Heterogeneous vs. homogeneousHeterogeneous vs. homogeneous•• GeneralGeneral--purpose vs. applicationpurpose vs. application--specificspecific
OnOn--chip communication architecture (OCA)chip communication architecture (OCA)•• Bus (e.g. AMBA, Bus (e.g. AMBA, CoreConnectCoreConnect), packet switching network ), packet switching network
(e.g. Alpha 21364)(e.g. Alpha 21364)Memory hierarchyMemory hierarchy
µP
Communication Network
µP OSDriver
tasksµP
NetworkInterfaceNetwork
Interface
NetworkInterfaceNetwork
Interface
IP µP FPGA µP
NetworkInterfaceNetwork
Interface
NetworkInterfaceNetwork
Interface
DSPµP µP OSDriver
tasks
NetworkInterfaceNetwork
Interface
µP µP OSDriver
tasks
NetworkInterfaceNetwork
Interface
Configurable Configurable SoCSoC PlatformsPlatformsGeneral purpose processor cores + programmable fabricGeneral purpose processor cores + programmable fabric
Tight integration using extended instructions (Tight integration using extended instructions (ASIPsASIPs))•• Example: Altera Example: Altera NiosNios / / NiosNios IIII
Loose integration using Loose integration using FIFOsFIFOs/busses for communications/busses for communications•• Example: Xilinx MicroBlaze, etc.Example: Xilinx MicroBlaze, etc.
Custom instruction logic for Nios II [source: www.altera.com]
Xilinx MicroBlaze[source: www.xilinx.com]
ASIP Compilation: Problem StatementASIP Compilation: Problem Statement
1( )i
i Narea p A
≤ ≤
<∑
Given:Given:CDFG G(V, E)CDFG G(V, E)The basic instruction set The basic instruction set IIPattern constraints:Pattern constraints:•• Number of inputs Number of inputs ||PI(piPI(pi)| )| ≤≤ Nin;Nin;•• Number of outputs Number of outputs ||PO(piPO(pi)| = 1)| = 1;;•• Total area Total area
Objective:Objective:Generate a pattern library Generate a pattern library PPMap G to the extended instruction set Map G to the extended instruction set II∪∪PP, so that the total execution time , so that the total execution time is minimizedis minimized
* *
+
+
*a c e
t6
+
d
t1 = a * b;
t2 = b * c;;
t3 = d * e;
t4 = t1 + t2;
t5 = t2 + t3;
t6 = t5 + t4;
ext-inst1(MAC1: 2 cycles)
ext-inst2(MAC2: 2 cycles)
* 2 clock cycles + 1 clock cycle
t4 t5
Performance speedup = 9 / 5 = 1.8X
b
t4 = ext-inst1(a, b, c);
t5 = ext-inst2(b, c, d, e);
t6 = t4 + t5;
Target Core Processor ModelTarget Core Processor Model
Inst Cache
Reg File
Memory
MUX
4
Adder
Resu
ltPC
RS1
RS2
Core Processor
ID / EX
EX / MEM
MEM / WB
IF / ID
ALUOP1
OP2
Core processor modelCore processor modelClassic singleClassic single--issue pipelined RISC core (fetch / decode / execute / issue pipelined RISC core (fetch / decode / execute / memmem / / writewrite--back)back)
•• The number of input and output operands of an instruction is preThe number of input and output operands of an instruction is pre--determineddetermined•• An instruction reads the core register file during the execute sAn instruction reads the core register file during the execute stage, and commits tage, and commits
the result during the writethe result during the write--back stageback stage
CustomLogic
ASIP Compilation FlowASIP Compilation Flow
FrontFront--end compilationend compilation
Backend compilationBackend compilation
1. Pattern generation1. Pattern generation2. Pattern selection2. Pattern selection
3. Application mapping &3. Application mapping &Graph coveringGraph covering
Pattern GenerationSatisfying input/output constraints
Pattern SelectionSelect a subset to maximize the potential speedup while satisfying the resource constraint
Application MappingGraph covering tominimize the total execution time
C codeC code µ µArchArch
constraintconstraint
CDFGCDFG
Pattern libraryPattern library
OptimizedOptimizedCDFGCDFG
Optimized assemblyOptimized assembly
Experimental Results on Altera Experimental Results on Altera NiosNios
-1.77%-2.54%-2.75 3.08 Average
560.00%02.76%1863.224.754mcm160.00%00.80%543.023.282dir140.00%01.05%711.751.572pr80.15%1,0240.76%512.142.402fir400.71%4,7363.79%2553.733.187iir169.79%65,5366.06%4082.653.289fft_br
DSP BlockMemoryLENiosEstimation
Resource OverheadSpeedupExtended Instruction#
---
560.00%02.76%1863.224.754160.00%00.80%543.023.282140.00%01.05%711.751.57280.15%1,0240.76%512.142.402400.71%4,7363.79%2553.733.187169.79%65,5366.06%4082.653.289
LENios
Altera Altera NiosNios is used for ASIP implementation is used for ASIP implementation 5 extended instruction formats5 extended instruction formatsup to 2048 instructions for each formatup to 2048 instructions for each format
Small DSP applications are taken as benchmarkSmall DSP applications are taken as benchmark
Data bandwidth problemData bandwidth problem•• Limited register file bandwidth (two read ports, one write port)Limited register file bandwidth (two read ports, one write port)•• ~40% of the ideal performance speedup will be lost~40% of the ideal performance speedup will be lostShadowShadow--registerregister--based architectural extensionbased architectural extension
Core registers are augmented by an extra set of shadow registersCore registers are augmented by an extra set of shadow registers•• Conditionally written during writeConditionally written during write--back stage back stage •• Low power/area overheadLow power/area overhead
Novel shadowNovel shadow--register binding algorithms are developedregister binding algorithms are developed
Inst Cache
Reg File
Memory
MUX
4
AdderRe
sult
PC
RS1
RS2
Core Processor
ID / EX
EX / MEM
MEM / WB
IF / ID
ALU
HashingUnit
HashingUnit
OP1
OP2
CustomLogic
SR1SR1
SRKSRK
…k = hash(j)
Architecture Extension for Architecture Extension for ASIPsASIPs
Ongoing Work : Mapping for Heterogeneous Integration Ongoing Work : Mapping for Heterogeneous Integration with Multiple Processing Coreswith Multiple Processing Cores
Given:Given:A library of processing cores A library of processing cores P P and communication library and communication library C C Task graph Task graph GG((VV, , EE))•• For each For each v v in in VV, execution time , execution time tt((vv, , ppii) on ) on ppii
•• For each (For each (u, vu, v) in ) in EE, communication data size , communication data size ss((uu,,vv))Throughput constraintThroughput constraint
Problem:Problem:Select and instantiate the processing elements and communicationSelect and instantiate the processing elements and communication channels channels from from P P andand C C respectivelyrespectivelyMap the tasks onto the processing elements and communications toMap the tasks onto the processing elements and communications to the the channels so thatchannels so that
•• The optimal latency is achieved subject to the throughput constrThe optimal latency is achieved subject to the throughput constraintaint•• The implementation cost is minimizedThe implementation cost is minimized
MPEGMPEG--4 Simple Profile Decoder: Architecture Profiling 4 Simple Profile Decoder: Architecture Profiling
220
1901
508
1092
312
358
287
Orig. C line #
textureUpdate.cTexture Update
texture_idct.cTexture/IDCT
texture_vld.c
parser.cParser/VLD
Motion-Compensation.c
Motion Comp.
displayControl.cDisplay Controller
copyControl.cCopy Controller
Orig. CSource File
Module Name
18.1%18.1%Texture/IDCTTexture/IDCT
15.7%15.7%Motion Comp.Motion Comp.
3.6%3.6%Copy ControllerCopy Controller
59.0%59.0%Parser/VLDParser/VLD
•• Runtime Profiling (PowerPC/XUP board)Runtime Profiling (PowerPC/XUP board)
•• C specification overviewC specification overview
MPEGMPEG--4 Simple Profile Decoder: 4 Simple Profile Decoder: HypridHyprid HW/SW HW/SW ImpmentationImpmentation
Software blocks running on PowerPC
HW block HW block Integrated with Integrated with PowerPC single PowerPC single process design:process design:
15% speed 15% speed improvementimprovement
MPEGMPEG--4 Simple Profile Decoder: 4 Simple Profile Decoder: Alternate ImplementationsAlternate Implementations
3.533.533.063.061.181.180.590.59ThroughputThroughput(Frame per Second)(Frame per Second)
+ 15.3%+ 15.3%+ 68.4%+ 68.4%+ 209%+ 209%--ImprovementImprovement
Single Single PowerPCPowerPC77--uBlazeuBlaze Single PowerPC w/Single PowerPC w/
HW Motion Comp.HW Motion Comp.Single Single uBlazeuBlaze
• xPilot Synthesis Report of HW blocks
3353357.9137.913441551 (1696, 1931)1551 (1696, 1931)4475447582278227160160Texture UpdateTexture Update2802807.9637.96326261877 (2376, 2438)1877 (2376, 2438)2731273195349534200200Block IDCTBlock IDCT5055057.977.9722986 (1111, 1017)986 (1111, 1017)5655565599039903210210Motion Comp.Motion Comp.
RTL RTL VHDLVHDL
RTL RTL SystemCSystemCCC
Latency Latency (Cycles)(Cycles)
Clock Clock period (ns)period (ns)MULMULSlices ( Slices ( FFsFFs, , LUTsLUTs))
Line countsLine counts
ConclusionsConclusionsxPilot has fairly mature and advanced behavior synthesis capabilxPilot has fairly mature and advanced behavior synthesis capability ity from C or SystemC to RTL code with necessary design constraintsfrom C or SystemC to RTL code with necessary design constraints
xPilot advantages includexPilot advantages includePlatformPlatform--based behavior and system synthesisbased behavior and system synthesisCommunication/interconnectCommunication/interconnect--centric approachcentric approachAdvanced algorithms for platformAdvanced algorithms for platform--based, communicationbased, communication--centric optimizationcentric optimizationPromising results demonstrated on available FPGAsPromising results demonstrated on available FPGAs
xPilot system synthesis capabilitiesxPilot system synthesis capabilitiesPerformance simulation of multiPerformance simulation of multi--processor systemsprocessor systemsExploration the efficient use of (multiple) onExploration the efficient use of (multiple) on--chip processorschip processorsCompilation and optimization for reconfigurable processorsCompilation and optimization for reconfigurable processors
AcknowledgementsAcknowledgementsWe would like to thank the supports from We would like to thank the supports from
GigascaleGigascale Systems Research Center (GSRC) Systems Research Center (GSRC) National Science Foundation (NSF)National Science Foundation (NSF)Semiconductor Research Corporation (SRC)Semiconductor Research Corporation (SRC)Industrial sponsors under the California MICRO programs (Altera,Industrial sponsors under the California MICRO programs (Altera, Xilinx)Xilinx)
Team members:Team members:
Yiping FanYiping Fan Zhiru ZhangZhiru ZhangWei JiangWei JiangGuoling HanGuoling Han