A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units
description
Transcript of A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units
Faculty of Sciences and TechnologyUniversity of Algarve, Faro
João M. P. Cardoso
April 30, 2001
IEEE Symposium on Field-Programmable Custom Computing Machines, Rohnert Park, CA, USA
A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units
A Novel Algorithm Combining Temporal Partitioning and Sharing of Functional Units
Portugal
IndexIndex
Introduction
Temporal Partitioning
Problem Definition
New vs Previous Approach
Algorithm Working Through an Example
Experimental Results
Related Work
Conclusions
Future Work
IntroductionIntroduction
“Virtual Hardware”: Reuse of devices Save silicon area View “unlimited resources” Enabled by the dynamically reconfigurable FPGAs
Two concepts: Context switching among functionalities Allowing a large “function” to be executed
FPGA devices allowing virtualization: off-chip configurations on-chip configurations
Several research efforts…
IntroductionIntroduction
Answers: Temporal Partitioning Sharing of Functional Units
Goal: combining the two...
dx
+
u
-
u
-
dx
+
u_1
x y
dxx
x_1
dxu
y_1
+
y<< 1 << 1
Size larger than the available reconfigware area?
Temporal PartitioningTemporal Partitioning
uxdxx u
aux1
+
x_1
dx
y_1
+
y<< 1
time
Temporal PartitioningTemporal Partitioning
aux1
dx
-
u
-
dx
+
u_1
y
<< 1
time
Temporal PartitioningTemporal Partitioning
aux1
+
ux
dxx
x_1
dxu
y_1
+
y<< 1
aux1
dx
-
u
-
dx
+
u_1
y
<< 1
time
Temporal PartitioningTemporal Partitioning
Create temporal partitions to be executed by time-sharing the device
Netlist level (structural) Difficulties when dealing with feedbacks Loss of Information Flat structure Intricate for exploiting sharing of functional units
Behavioral level (functional) Loops can be explicitly represented Better design decisions “A must” for compilers for reconfigurable computing
Problem DefinitionProblem Definition
But, if we decrease the needed area by sharing functional units?
Simultaneously Temporal Partitioning and sharing of Functional Units
THE PROBLEM:
Given a dataflow graph (representing a behavioral description), a library of components,...
Map the dataflow graph onto the available resources of the FPGA device: Considering sharing of Functional Units Considering Temporal Partitioning Decreasing the overall execution latency
New vs Previous ApproachNew vs Previous Approach
Previous
Simultaneously Temporal
Partitioning and High-Level Synthesis
Component Library
ConstraintsDFG, CDFG
Circuit-generation,
Logic Synthesis
Temporal Partitioning
High-Level Synthesis
Component Library
Circuit-generation,
Logic Synthesis
ConstraintsDFG, CDFG
New
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Suppose the following dataflow graphSuppose the following dataflow graph Consider:
Area(+) = 1 cell Area(x) = 2 cells Delay(+) = 1 control step (cs) Delay(x) = 2 cs
Total area of the DFG: 8 cells
Available Area: 3 cells
0 1
2
3
4
5
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Calculate ASAP and ALAP valuesCalculate ASAP and ALAP values
Node 0 1 2 3 4 5ASAP 0 0 1 0 2 3ALAP 1 1 2 0 2 3
0 1
2
3
4
5
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Identify the critical pathIdentify the critical path
Node 0 1 2 3 4 5ASAP 0 0 1 0 2 3ALAP 1 1 2 0 2 3
0 1
2
3
4
5
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Create an initial number of TPs: suppose 3Create an initial number of TPs: suppose 3
0 1
2
3
4
5
MAXCS
1
2
3
Area
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Map each node of the critical path on each temporal partitionMap each node of the critical path on each temporal partition
0 1
2
3
4
5
MAXCS
2 cs
1
2
3
3
4
5
Area
1 cs
1 cs
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)
0 1
2
3
4
5
MAXCS
2 cs
1
2
3
3
4
5
Area
1 cs
1 cs
Algorithm Working Through an ExampleAlgorithm Working Through an Example
0
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
0 1
2
3
4
5
Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)
Algorithm Working Through an ExampleAlgorithm Working Through an Example
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
0 1
2
3
4
5
Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)
Algorithm Working Through an ExampleAlgorithm Working Through an Example
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
3
Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)
0 1
2
3
4
5
Algorithm Working Through an ExampleAlgorithm Working Through an Example
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
2
Try to map nodes in each temporal partition (2)Try to map nodes in each temporal partition (2)
0 1
2
3
4
5
Algorithm Working Through an ExampleAlgorithm Working Through an Example
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
Try to map nodes in each temporal partition (3)Try to map nodes in each temporal partition (3)
0 1
2
3
4
5
2
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Relax: add 1 clock step to MAXCS Relax: add 1 clock step to MAXCS
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
0 1
2
3
4
5
Algorithm Working Through an ExampleAlgorithm Working Through an Example
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
0 1
2
3
4
5
3
Try to map nodes in each temporal partition (1)Try to map nodes in each temporal partition (1)
Algorithm Working Through an ExampleAlgorithm Working Through an Example
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
0 1
2
3
4
5
Try to map nodes in each temporal partition (2)Try to map nodes in each temporal partition (2)
2
Algorithm Working Through an ExampleAlgorithm Working Through an Example
10
2 cs
1
2
3
3
4
5
1 cs
1 cs
MAXCSArea
0 1
2
3
4
5
2
Try to map nodes in each temporal partition (2)Try to map nodes in each temporal partition (2)
2
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Merge Operation (1) Merge Operation (1)
10
2 cs
1
2
3
3
4
5
2 cs
1 cs
MAXCSArea
0 1
2
3
4
5
2
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Merge Operation (1) Merge Operation (1)
10
1,2
3
3
4
5
MAXCSArea
2
0 1
2
3
4
54 cs
1 cs
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Merge Operation (2) Merge Operation (2)
10
1,2
3
3
4
5
1 cs
MAXCSArea
2
0 1
2
3
4
54 cs
Algorithm Working Through an ExampleAlgorithm Working Through an Example
Merge Operation (2) Merge Operation (2)
10
1,2,3
3
4
5
MAXCSArea
2
0 1
2
3
4
5
4 cs
Experimental ResultsExperimental Results
Near-optimal w/o sharing vs sharingNear-optimal w/o sharing vs sharing
0
2
4
6
8
10
12
14
16
18
#T
Ps
-30%
-20%
-10%
0%
10%
20%
30%
Pe
rf. Im
pro
v.
#p(SA) #p(Our*)#p(Our*) %(#cs-Our*)%(#cs-Our**)
EX1 SEHWA HAL EWF
Experimental ResultsExperimental Results
048
12
16202428
#TP
s
-16%-10%-4%2%8%14%20%26%32%
Per
f. Im
prov
.
#p(SA) #p(Our*) #p(Our*)
%(#cs-Our*) %(#cs-Our**)
Near-optimal w/o sharing vs sharingNear-optimal w/o sharing vs sharing
FIR MAT4x4
72 37
Experimental ResultsExperimental Results
Performance vs No. of Temporal PartitionsPerformance vs No. of Temporal Partitions
Mult4x4, RMAX=10 (no sharing of adders)
05
1015202530
1 3 5 7 9 11 13 15 17 19 21 23 25Initial Number of TPs
Final
#TPs
646668
7072
Exec
. (#c
s)
TPsExec.
Experimental ResultsExperimental Results
Is the algorithm good for scheduling?Is the algorithm good for scheduling?
0
5
10
15
20
25
30
35
#cs
known scheduling results
Our
EWF SEHWA
Comparison to some optimum results
Related WorkRelated Work
List-Scheduling considering dynamic reconfiguration [Vasilko et al., FPL’96]
ASAP [GajjalaPurna et al., IEEE Trans. on Comp., 1999]
Minimize latency taking onto account communication costs [Cardoso et al. VLSI’99]: Enhanced Static-List Scheduling Iterative approach (Simulated Annealing)
ILP formulation [SPARCs, DATE’98; RAW’98]
Enhanced Force-Directed List Scheduling [Pandey et al., SPIE’99]
And others [see the Related Work section]
ConclusionsConclusions
Novel algorithm simultaneously doing temporal partitioning and sharing of functional units Low complexity Heuristic approach Based on gradually enlarging of time slots
Permits to exploit the duality between the number of temporal partitions and resource sharing
Close-to-optimum results with some examples
Results proved that the algorithm is not weak when performing scheduling
Future WorkFuture Work
Enhancements to the algorithm: consider functional units with pipelining consider pipelining between execution and
reconfiguration
Study the possibility to take into account communication and reconfiguration costs
Test results with a reconfigurable computing system (comercial board)
Contact AuthorContact Author
João M. P. Cardoso
http://w3.ualg.pt/~jmcardo
THANK YOU!