HW/SW Co-design System Partitioning in HW/SW Co-Design
description
Transcript of HW/SW Co-design System Partitioning in HW/SW Co-Design
-
*OutlineHW/SW Codesign for Embedded Systems
System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation Methods for Multiple Objectives
Summary
-
*Embedded System DesignAn embedded system is a computing device in general subject to a specific purpose and its implementation is predominantly deter-mined by this purpose, usually entailing a complete encapsulation into the environment where this purpose is located at.AutomotivePhones/PDAsTransceiver (WIFI, WLAN, xDSL,...)
-
*Embedded System Design Flow
-
*OutlineEmbedded System Design
System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation for Multiple Objectives
Summary
-
*Heterogeneous PlatformsClassical HW/SW Codesign Platform
Is around for ~20 yearsServed well to get a first grip on partitioningHas not gained any relevance for industrial design flows
-
*Heterogeneous PlatformsModern rapid prototyping platforms
Prototyping board for real-time MIMO OFDM DSP+MicrocontrollerFPGAsBusses and BridgesRAM and RegistersInterfaces
-
*Heterogeneous PlatformsModern SoC/embedded platforms
UMTS baseband trans- ceiver chip (2003)DSP+MicrocontrollerASICsBusses and BridgesRAM and RegistersInterfaces
-
*Heterogeneous PlatformsLibrary forDSPsCache/RAMSchedulesFPGARAM/FlashSlices/GatesASICsRegisters/GatesChannelsFifo/Direct/BusMemorySchedulesParallel read/write access
-
*OutlineEmbedded System Design
System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation for Multiple Objectives
Summary
-
*Mapping Graphs to PlatformsSystem Graphs
MF
MF
EnAcc
PeakDet
MF
PSCH
SSCH
GroupCode
MF
...
16x
ADC
Rakereceiver
Finger 1
Fingerselect
De-spreadPilot
De-spreadData
FreqOffsetEst
WeightGain
TPC/FBI
x
Finger 2
... Finger N
Sear-cher
...
Sear-cher
PathSelect
TCU
CellSearcher
DelayProfileEstimator
4 QAMDemod
Deinterleaver 1
Desegment
Deinterleaver 2
Turbo
Decoder
Viterbi
DecSwitch
Demodulator
Logical DataProcessing
CRC
12x
12x
root
BB1
exit
=
k
-
*Mapping Graphs to Platforms
v1
v3
ProcessGraph
v2
v6
v5
v4
-
*Mapping Graphs to PlatformsNP-hard multi-objective optimisation problemProven to be NP-complete by restriction to the classical graph partitioning problem
19
12
2
4
3
15
3
1
2
5
2
3
9
3
4
19
12
2
4
3
15
3
1
2
5
2
3
9
3
4
-
*OutlineEmbedded System Design
System PartitioningHeterogeneous PlatformsMapping Graphs to PlatformsHeuristic Optimisation for Multiple Objectives
Summary
-
*Heuristic OptimisationMulti-objective optimisation problemA mapping of a problem instance I is called valid, iff , with being objective functions and being constraints. : is the mapping relation of a vertex i to the jth implementation alternative A on resource r.
Objective functions:Area for HW in gates/slices/NAND2 equivalents ( ) :
, with for ASICs, for FPGAs
Code size for SW in bytes ( ) : , with for code size on DSPs.
...
-
Heuristic optimisationObjective function fT : system delay (makespan)
Multi-core scheduling is NP-hard as well
v1
v2
v6
v5
v4
v3
ProcessGraph
DSP
FPGA
ASIC
v1
v2
v3
v6
v5
v4
DirectASIC-FPGA
Bus
SharedRAM
read
write
processing
SDRAM
Sche-dule
SDRAM
Schedule
ASIC
FIFOASIC-DSP
-
*Heuristic OptimisationDefinition A heuristic is a robust technique for the design of (randomised) algorithms for optimisation problems, and it provides (randomised) algorithms for which one is not able to guarantee at once the efficiency and the quality of the computed feasible solutions, even not with any bounded constant probability P > 0.
-
*Heuristic OptimisationPartitioning analytically not solvableUse heuristic methods Simulated AnnealingTabu SearchKernighan-Lin min-cutGenetic AlgorithmParticle SwarmCustom Heuristics (GCLP, RRES, etc.)...
-
*Heuristic OptimisationClassical Kernighan-Lin min-cutModificationsMore than two partitionsUnbalanced partitions allowedMultiple objectivesOmit change list ...
Partition 2
A
E
G
F
D
B
C
-
*SummaryScheduling/Partitioning is a hard optimisation problemHeuristic methods have to be appliedHighly dependent on platform model and high level estimation techniquesMany questions yet unsolvedExecution time profiles for processes (control flow)Estimation uncertaintiesAutomated platform composition...
-
*Outline
Thank you for your attention
-
Typcial Graphs Industry Design for xDSL Transceiver
ac_im_firFIR9 taps ?
ac_im_lp1WDF1.O
ac_im_lp2WDF 1.O
ac_rx_firFIR5 taps
ac_rx_hp1WDF1.O
8k -> 16k
ac_rx_lp1 WDF7.O
ac_rx_gain2
16k -> 32k
HOLD32k -> 256k
ac_rx_lp2WDF5.O
ac_rx_lp3 WDF 9.O
ac_rx_gain1
ac_th_firFIR9 taps ?
ac_th_ap WDF 1.O
ac_rx_trim
Hold 3.O 256k -> 16M
ac_th_hpWDF 1.O
ac_tx_gain1
ac_tx_hp1WDF 3.O
16k -> 8k
ac_tx_lp1WDF 7.O
ac_tx_fir FIR5 taps
32k -> 16k
256k -> 32k
ac_tx_lp2WDF5.O
ac_tx_lp3 WDF 5.O
ac_tx_gain3
ac_tx_trim
+
ac_th_tx
ac_im_data
+
ac_tx_hp_dis
ac_tx_co16
ac_rx_hp_dis
ac_th_hp_dis
ac_rx_fir_dis
+
ac_rx_co256
ac_im_dis
+
ac_rx_gain_dis
ac_tx_gain_dis
scaling *4
ac_tx_gain_dis
scaling *8
ac_th_dis
+
ac_tx_hp_dis
ac_tx_fir_dis
ac_rx_gain_dis
ac_rx_16k
trimming gain:0db +1.xdB(ac_rx_trim)
trimming gain:0db -1.xdB(ac_tx_trim)
24
round:24 msb ->17
scaling * 4(done in lpim2 together with 0.5 default wdf scaling)
ac_im_gain
ac_tx_hp2WDF 1.O
ac_tx_gain3_disscaling *2
ac_im
ac_th
ac_tx_im
ac_rx_trim
ac_rx_hp1_0
ac_rx_fir_0 - ac_rx_fir_4
ac_rx_lp1_0 - ac_rx_lp1_6
ac_rx_gain1
ac_rx_gain2
ac_tx_trim
ac_tx_gain3
ac_tx_gain2
ac_tx_gain1
ac_tx_lp1_0 -ac_tx_lp1_6
ac_tx_lp2_0 -ac_tx_lp2_4
ac_tx_lp3_0 -ac_tx_lp3_4
ac_tx_hp1_0 -ac_tx_hp1_2
ac_tx_hp2_0
ac_tx_fir_0 -ac_tx_fir_4
ac_im_fir_0 -ac_im_fir_9
ac_th_fir_0 -ac_th_fir_8
ac_th_ap_0
ac_th_hp_0
ac_im_lp2_0
ac_im_gain
ac_im_lp1_0
+223-1
-223+1
ac_tx_gain2
ac_rx_lp2_0 - ac_rx_lp2_4
ac_rx_lp3_0 - ac_rx_lp3_8
scaling *2
z-nn=0..24
ac_rx_gain1
ac_rx_gain_dis
ac_rx_gain1
ac_im_delay
z-1
Eventually additional logic needed to reprogram IM filter for Flexi Slic
Eventually additional logic needed to reprogram fir filter for Flexi Slic
-
Graph PropertiesDegree of parallelism = |VCP| / |V|Density = |E| / |V| Rank-Locality rloc = 1 / |E| (rank(vhead) rank(vtail))rank
= = 1.375
22
16
rloc = = 1.227
27
22
0
1
2
3
4
5
= = 2
16
8
6
7
-
Restricted Range Exhaustive SearchCreate task graphCreate ordered vector of processesCreate initial mapping
Start exhaustive search on subset of processes (window)Move window along the vectorFinally map process that leaves the window
Strong performance for typical graphsDegree of parallelismDensityLocality
j
k
l
i
a
e
d
g
c
b
f
h
Tentatively mapped
f
b
d
e
i
a
c
h
k
j
l
g
Vertex vector
Finally mapped
-
Results NormalisedRelativeCost = f (parallelism, locality)AveragedCostWindow LengthAveragedValidity
GA
50
100
150
= [1..10]
1.00
1.04
1.08
min
RRES
min
TS
min
2.6
5
10
15
W
< 50
2.5
2.7
2.8
ES
2.9
RRES
0
20
40
60
80
ES
RRES
-
*The Genome CodingArrange vertices on a stringString elements (alleles) indicate implementation alternative
What about the order of the vertices? Does it matter?
Genome of |V| vertices
...
v1
v|V|-1
v|V|
v2
v3
v4
v5
4
1
2
1
3
2
2
Id : Ai,j(vk) = (cs, et, gc)
1 : A0,0(v2) = (50, 256, 0)2 : A0,1(v2) = (40, 340, 0)3 : A1,0(v2) = (88, 192, 0)4 : A1,1(v2) = (72, 224, 0)5 : A3,0(v2) = ( 0, 92, 880)...
List of implementation alternatives for v2
vk
-
*Recombination with chromosomes1-point crossoverMulti-point crossoverUniform crossover
Why does it work?Fundamental schema theorem and the building block hypothesisSchema theoremShort, low-order, above average schemata (building block) proliferateBelow-average schemata die off
What makes schemata fit in system partitioning?
Defining length = 7Order o = 5Wildcard * = unspecified
a
b
k
j
i
h
g
f
e
d
c
m
l
*
*
*
*
1
*
6
*
4
1
2
*
*
Chromosome Schema
-
*Combinatorial vs. structural fitnessCombinatorial (area, code size, time)Low resource consumption is ensured for any single vertexCombination of assignments utilise resources optimallyStructural (time)Exact graph matching bet- ween task and architecture subgraphsParallel execution of proces- ses and data transfers
Structural fitness requires a representation in the chromo- someBuilding blocks are short, low-order, and fit schemata
h
g
f
e
-
*Coding for structural exploitationLocality preserving chromosome codingAdjacent vertices in task graph shall be adjacent in chromosomeUse two schedules As soon as possibleAs last as possibleArrange vertices vi in increasing average start times: stavg(vi) = stasap(vi) + stalap(vi)
l
n
i
a
e
d
g
c
b
f
h
asap
alap
a
b
c
e
d
g
f
j
k
i
h
k
j
l
m
n
a
b
c
e
d
g
f
j
k
i
h
l
m
n
stasap(b)
stalap(b)
Rank
0
1
2
7
6
5
4
3
m
-
*ResultsImpact of genome codingCostnewrankrandom
-
*More resultsStructural mutation1-gene mutation (M1g)Swap mutation (Msw)Multi-swap mutation (Mbb)
-
*More resultsComparison with other heuristicsPenalty reward tabu search (pwTS)Simulated annealing (SA)Global criticality/local phase (GCLP)
Averaged cost Averaged Validity
-
*Conclusion3-operator GA has been implemented and analysedStructural problem components (time) have been exposedGenome coding Locality preserving orderingMutation Multi-swap mutationCrossover depends heavily on building block sizeComparison with heuristics from literature showed superior performance of GA over pwTSIn contrast to published work
-
*ResultsRelated to crossover recombinationUniform10-point5-point1-pointnewrandom
-
*More resultsSelection over mutation probabilityBinary tournament (BT)Survival of the fittest (SOTF)Roulette wheel (RW)
***********************************