Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures
description
Transcript of Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures
1 University of MichiganElectrical Engineering and Computer Science
Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath
Architectures
Michael Chu, Kevin Fan, Rajiv Ravindran, Scott MahlkeAdvanced Computer Architecture Lab
University of Michigan
Workshop on Application-Specific Processors (WASP-2)December 2, 2003
2 University of MichiganElectrical Engineering and Computer Science
Clustered Architectures
• Decentralize architecture to reduce register file bottleneck• Used in Lx/ST200, TI C6x, Analog Tigersharc and others.• Goal of our work: Automatic synthesis of an application-
specific heterogeneous multicluster architecture
Homogeneous Clustered Architecture
Register File
Cluster 1 (32-bit)
Register File
Cluster 2 (32-bit)
+*-<< +*-<< +*-<< +*-<< +*-<< +*-<<
Heterogeneous Clustered Architecture
Register File
+ -
Cluster 1 (32-bit)
<<
Register File
+ - + -
Cluster 2 (8-bit)
<<*
3 University of MichiganElectrical Engineering and Computer Science
Our Approach• Partition operations with both performance and required
hardware cost in mind– Maintain performance and reduce cost (bitwidth, FU repertoire)– Previous work has focused on single basic block, single cluster
[Note ‘91] [Paulin ‘89] [Marwedel ‘90]• Each partition dictates a cluster configuration which has an
associated hardware cost
RFFU
FU
RFFU
FU
4 University of MichiganElectrical Engineering and Computer Science
Our Proposed System• Today’s Focus: Cost-Sensitive Operation Partitioning• Input: Application, High-level machine specification:
– Number of clusters, number of generic FU’s
• Output: Multicluster Architecture Description
5 University of MichiganElectrical Engineering and Computer Science
Cost-Sensitive Operation Partitioning• Builds off Region-Based Hierarchical Operation Partitioning
– Pure performance based partitioner, no notion of hardware cost– Weight calculation creates guides for good partitions– Partitioning clusters based on given weights
• Cost metric added to Graph Partitioning phase which accounts for gate cost
Weight Calculation
GraphPartitioning
11
10
10
10
10
1
8
8
8
8 8
81 1
1 1 11 1
1 11
Region
6 University of MichiganElectrical Engineering and Computer Science
Coarsening Phase• Progressively groups highly related operations together
– Continually pairs operations together– Forces partitioner to consider several operations as a single unit– Traditional RHOP: coarsen using edge weights– Cost-centric coarsening can ignore dependence edge criticality
Coarsened State 1 Coarsened State 2 Coarsened State 3 Coarsened State 4
Narrow bitwidth Wide bitwidth
7 University of MichiganElectrical Engineering and Computer Science
Partitioning Phase• Travel back through each of the coarsening steps, at each
stage try refining partition– est_cycles: performance metric from traditional RHOP– Adds new cost metric for cost of the cluster
costeperformancbenefit
oldoldnewnew cyclesestcostcyclesestcost
11
8 University of MichiganElectrical Engineering and Computer Science
Cost-Sensitive Refinement• Moves are made when they have positive benefit• When no more moves can be made, algorithm uncoarsens
to previous coarsened state and tries moving again
est cycles = 7cost: 28K
est cycles = 8cost: 15K
est cycles = 7cost: 15K
Narrow bitwidth Wide bitwidth
9 University of MichiganElectrical Engineering and Computer Science
Multicluster Cost Model• Cost model determines an estimate of gate cost of clusters
– Estimate minimum required cost to support partitioned operations • Factors that influence hardware cost:
– Register file size/width– Functional Unit (FU) width– FU opcode repertoire
• Greedy algorithm used– Ignores dependences between
operations– Similar to Rec/Res MII calculations
for software pipelined loops
Register File (32-bit)
Int Unit 1 Int Unit 2
*10
Highcost
Lowcost
*16
+8
+16
+32
+16
*10
*16
+8
+16
+32
+16
Total cost of cluster: 1 32-bit register file 1 16-bit multiplier/adder 1 32-bit adder
10 University of MichiganElectrical Engineering and Computer Science
Experimental Methodology• Trimaran toolset: a retargetable VLIW compiler• Evaluated main loop of DSP kernels and selected
benchmarks from MediaBench, MiBench and NetBench• Bitwidth information gathered through automatic program
analysis• Cost estimates computed using Synopsis design tools at
0.18µ
• 64 registers per clusterName Configuration
2-2111 2 clusters2 I, 1 F, 1 M, 1 B per cluster
4-2111 4 clusters2 I, 1 F, 1 M, 1 B per cluster
11 University of MichiganElectrical Engineering and Computer Science
2-Cluster Cost Savings and Performance
chan
nel
dct fft
fsed
huffm
an LU rls
rawc
audi
o
rawd
audi
o
gsm
deco
de
gsm
enco
de
blow
fish
crc url
Aver
age
Perc
enta
ge P
erfo
rman
ce L
oss
/ Cos
t Sav
ings
-20.0
-10.0
0.0
10.0
20.0
30.0
40.0
Performance Loss Cost Savings
12 University of MichiganElectrical Engineering and Computer Science
Source of Cost Savings BreakdownNo
rmal
ized
Cost
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CS-RHOP 32-bit CS-RHOP
chan
nel
dct fft
fsed
huffm
an LU rls
rawc
audi
o
rawd
audi
o
gsm
deco
de
gsm
enco
de
blow
fish
crc url
Aver
age
13 University of MichiganElectrical Engineering and Computer Science
Pareto Charts of Examined Machines
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
6 6.5 7 7.5 8 8.5 9 9.5
Cost (thousands of gates)
Rel
ativ
e Pe
rform
ance
fsed kernel LU kernel
• A wide spectrum of machine configurations were examined• Multiple groups often appear with expensive units
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
18 20 22 24 26 28 30 32 34 36 38 40
Cost (thousands of gates)R
elat
ive
Perf
orm
ance
14 University of MichiganElectrical Engineering and Computer Science
Work in Progress• Merging step
– How can machine designs for several basic blocks be combined?• Inaccurate cost model
– How can a more accurate estimate for the cost be developed?• Space Exploration (external/internal)
– Number of clusters and generic FU’s are externally spacewalked– Allowable performance increase internally spacewalked– What areas of this space exploration should be external/internal?
• Reprogrammability of designed machines
15 University of MichiganElectrical Engineering and Computer Science
Conclusions• Developed a cost-sensitive method for partitioning operations across
clusters• Used this partitioning to define an application-specific low-cost
multicluster datapath architecture• Average performance loss and cost savings for two and four cluster
machines:
Machine Configuration Performance Loss Cost Savings
2-cluster -5.4% 20.4%
4-cluster -2.5% 28.0%
16 University of MichiganElectrical Engineering and Computer Science
Questions?
http://cccp.eecs.umich.edu
17 University of MichiganElectrical Engineering and Computer Science
Backup Slides
18 University of MichiganElectrical Engineering and Computer Science
4-Cluster Cost Savings and Performance
-20.0
-10.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
Performance Cost Savings
chan
nel
dct fft
fsed
huffm
an LU rls
rawc
audi
o
rawd
audi
o
gsm
deco
de
gsm
enco
de
blow
fish
crc url
Aver
age
Perc
enta
ge P
erfo
rman
ce L
oss
/ Cos
t Sav
ings
19 University of MichiganElectrical Engineering and Computer Science
Previous Work• Datapath synthesis
– Cathedral-III: complete synthesis system from IMEC– Paulin and Knight: force directed scheduling– Sehwa: designed processing pipelines from behavioral specs– PICO: designed application-specific VLIW processors
• Bitwidth sensitive datapath synthesis– Valen-C: augmented C language to convey bitwidth information
20 University of MichiganElectrical Engineering and Computer Science
Weight Calculation Phase• Edge weights
– Assigns higher weight to edges likely to increase schedule length when cut– Uses a slack distribution method to assign weights
• Node weights– Assigns weights to each operation based on how much it is likely to effect the load of the FUs in the cluster– Higher weights attributed to operations that can
• Not changed from Traditional RHOP