Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

1 University of MichiganElectrical Engineering and Computer Science

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath

Architectures

Michael Chu, Kevin Fan, Rajiv Ravindran, Scott MahlkeAdvanced Computer Architecture Lab

University of Michigan

Workshop on Application-Specific Processors (WASP-2)December 2, 2003


Clustered Architectures

• Decentralize architecture to reduce register file bottleneck• Used in Lx/ST200, TI C6x, Analog Tigersharc and others.• Goal of our work: Automatic synthesis of an application-

specific heterogeneous multicluster architecture

Homogeneous Clustered Architecture

Register File

Cluster 1 (32-bit)

Register File

Cluster 2 (32-bit)

+*-<< +*-<< +*-<< +*-<< +*-<< +*-<<

Heterogeneous Clustered Architecture

Register File

+ -

Cluster 1 (32-bit)

<<

Register File

+ - + -

Cluster 2 (8-bit)

<<*


Our Approach• Partition operations with both performance and required

hardware cost in mind– Maintain performance and reduce cost (bitwidth, FU repertoire)– Previous work has focused on single basic block, single cluster

[Note ‘91] [Paulin ‘89] [Marwedel ‘90]• Each partition dictates a cluster configuration which has an

associated hardware cost

RFFU

FU

RFFU

FU


Our Proposed System• Today’s Focus: Cost-Sensitive Operation Partitioning• Input: Application, High-level machine specification:

– Number of clusters, number of generic FU’s

• Output: Multicluster Architecture Description


Cost-Sensitive Operation Partitioning• Builds off Region-Based Hierarchical Operation Partitioning

– Pure performance based partitioner, no notion of hardware cost– Weight calculation creates guides for good partitions– Partitioning clusters based on given weights

• Cost metric added to Graph Partitioning phase which accounts for gate cost

Weight Calculation

GraphPartitioning

11

10

10

10

10

1

8

8

8

8 8

81 1

1 1 11 1

1 11

Region


Coarsening Phase• Progressively groups highly related operations together

– Continually pairs operations together– Forces partitioner to consider several operations as a single unit– Traditional RHOP: coarsen using edge weights– Cost-centric coarsening can ignore dependence edge criticality

Coarsened State 1 Coarsened State 2 Coarsened State 3 Coarsened State 4

Narrow bitwidth Wide bitwidth


Partitioning Phase• Travel back through each of the coarsening steps, at each

stage try refining partition– est_cycles: performance metric from traditional RHOP– Adds new cost metric for cost of the cluster

costeperformancbenefit

oldoldnewnew cyclesestcostcyclesestcost

11


Cost-Sensitive Refinement• Moves are made when they have positive benefit• When no more moves can be made, algorithm uncoarsens

to previous coarsened state and tries moving again

est cycles = 7cost: 28K



Narrow bitwidth Wide bitwidth


Multicluster Cost Model• Cost model determines an estimate of gate cost of clusters

– Estimate minimum required cost to support partitioned operations • Factors that influence hardware cost:

– Register file size/width– Functional Unit (FU) width– FU opcode repertoire

• Greedy algorithm used– Ignores dependences between

operations– Similar to Rec/Res MII calculations

for software pipelined loops

Register File (32-bit)

Int Unit 1 Int Unit 2

*10

Highcost

Lowcost

*16

+8

+16

+32

+16

*10

*16

+8

+16

+32

+16

Total cost of cluster: 1 32-bit register file 1 16-bit multiplier/adder 1 32-bit adder


Experimental Methodology• Trimaran toolset: a retargetable VLIW compiler• Evaluated main loop of DSP kernels and selected

benchmarks from MediaBench, MiBench and NetBench• Bitwidth information gathered through automatic program

analysis• Cost estimates computed using Synopsis design tools at

0.18µ

• 64 registers per clusterName Configuration

2-2111 2 clusters2 I, 1 F, 1 M, 1 B per cluster

4-2111 4 clusters2 I, 1 F, 1 M, 1 B per cluster


2-Cluster Cost Savings and Performance

chan

nel

dct fft

fsed

huffm

an LU rls

rawc

audi

o

rawd

audi

o

gsm

deco

de

gsm

enco

de

blow

fish

crc url

Aver

age

Perc

enta

ge P

erfo

rman

ce L

oss

/ Cos

t Sav

ings

-20.0

-10.0

0.0

10.0

20.0

30.0

40.0

Performance Loss Cost Savings


Source of Cost Savings BreakdownNo

rmal

ized

Cost

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CS-RHOP 32-bit CS-RHOP

chan

nel

dct fft

fsed

huffm

an LU rls

rawc

audi

o

rawd

audi

o

gsm

deco

de

gsm

enco

de

blow

fish

crc url

Aver

age


Pareto Charts of Examined Machines

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

6 6.5 7 7.5 8 8.5 9 9.5

Cost (thousands of gates)

Rel

ativ

e Pe

rform

ance

fsed kernel LU kernel

• A wide spectrum of machine configurations were examined• Multiple groups often appear with expensive units

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

18 20 22 24 26 28 30 32 34 36 38 40

Cost (thousands of gates)R

elat

ive

Perf

orm

ance


Work in Progress• Merging step

– How can machine designs for several basic blocks be combined?• Inaccurate cost model

– How can a more accurate estimate for the cost be developed?• Space Exploration (external/internal)

– Number of clusters and generic FU’s are externally spacewalked– Allowable performance increase internally spacewalked– What areas of this space exploration should be external/internal?

• Reprogrammability of designed machines


Conclusions• Developed a cost-sensitive method for partitioning operations across

clusters• Used this partitioning to define an application-specific low-cost

multicluster datapath architecture• Average performance loss and cost savings for two and four cluster

machines:

Machine Configuration Performance Loss Cost Savings

2-cluster -5.4% 20.4%

4-cluster -2.5% 28.0%


Questions?

http://cccp.eecs.umich.edu


Backup Slides


4-Cluster Cost Savings and Performance

-20.0

-10.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Performance Cost Savings

chan

nel

dct fft

fsed

huffm

an LU rls

rawc

audi

o

rawd

audi

o

gsm

deco

de

gsm

enco

de

blow

fish

crc url

Aver

age

Perc

enta

ge P

erfo

rman

ce L

oss

/ Cos

t Sav

ings


Previous Work• Datapath synthesis

– Cathedral-III: complete synthesis system from IMEC– Paulin and Knight: force directed scheduling– Sehwa: designed processing pipelines from behavioral specs– PICO: designed application-specific VLIW processors

• Bitwidth sensitive datapath synthesis– Valen-C: augmented C language to convey bitwidth information


Weight Calculation Phase• Edge weights

– Assigns higher weight to edges likely to increase schedule length when cut– Uses a slack distribution method to assign weights

• Node weights– Assigns weights to each operation based on how much it is likely to effect the load of the FUs in the cluster– Higher weights attributed to operations that can

• Not changed from Traditional RHOP

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

Documents

Transcript of Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures