University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path...
-
Upload
judith-dixey -
Category
Documents
-
view
213 -
download
0
Transcript of University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path...
1 University of MichiganElectrical Engineering and Computer Science
A Distributed Control Path Architecture for VLIW Processors
Hongtao Zhong, Kevin Fan, Scott Mahlke,and Michael Schlansker*
Advanced Computer Architecture LaboratoryUniversity of Michigan
*HP Laboratories
2 University of MichiganElectrical Engineering and Computer Science
Motivation
• VLIW Scaling Problem► Centralized resource► Highly ported structures► Wire delays
FU FU
Register File
Instruction Fetch/Decode
FU FU…FU FU
Register File
Instruction Fetch/Decode
FUFU FUFU
3 University of MichiganElectrical Engineering and Computer Science
Multicluster VLIW
• Distribute register files• Cluster function units• Distribute data caches• Clusters communicate
through interconnection network
• Used in TI C6x, Lx/ST200, Analog Tigersharc
FU FU FU FU
Register FileRegister File
Interconnection network
Instruction Fetch/Decode
Cluster 0 Cluster 1
4 University of MichiganElectrical Engineering and Computer Science
Control Path Scaling Problem• Larger I-cache
• Latency► Long wires for control
signals distribution
• Code compression► Hardware cost, power► Grow quadratically with the
number of FUs
GFED
CBAX
PC
BA
I-cache
IR
align/shiftnetwork
NOP NOP
5 University of MichiganElectrical Engineering and Computer Science
Straight Forward Approach• Distribute I-fetch in spirit similar to distribution of
data path► Local communication of controls► Reduce latency, hardware cost, power
• Used in Multiflow Trace 14/300 processors
I-cache
PC
IR
Interconnection network
PC
FU FU FU FU
Register FileRegister File
Interconnection network
I-cache
IR
FU FU FU FU
Register FileRegister File
6 University of MichiganElectrical Engineering and Computer Science
DVLIW Approach
• Simple distribution has problems► Doesn’t support code compression► PC still a centralized resource
I-cache
FU FU FU FU
Register File Register File
PC0
IR
Interconnection network
I-cache
FU FU FU FU
Register File Register File
PC
IR
Interconnection network
align/shift
PC1
align/shift
7 University of MichiganElectrical Engineering and Computer Science
DVLIW Execution Model
• Clusters execute in lock-step► When one cluster stalls, all clusters stall
• Clusters collectively execute one thread► Each cluster runs an instruction stream► Compiler orchestrates the execution of streams► Compiler manages communication► Light weight synchronization
8 University of MichiganElectrical Engineering and Computer Science
DVLIW Benefits
• Completely decentralized architecture► Distributed data path► Distributed control path
• Supports arbitrary code compression
• Exploiting ILP on multi-core style system► Good for embedded applications► Low cost► Compiler support
9 University of MichiganElectrical Engineering and Computer Science
DVLIW Architecture
VLIWCluster 0
VLIWCluster 1
VLIWCluster 3
VLIWCluster 2
Banked L2
br_target
PC
Next PC
BNOPA
BA
L1 D-C
acheL1 I-Cache
IR
Register Files
…
align/shift
IC MFUFU…
To Banked L2
Banked L2
To cluster 2 To cluster 1
10 University of MichiganElectrical Engineering and Computer Science
Code Organization• Code for each cluster
is consecutive in memory
• Operations in the same MultiOp stored in different memory locations
• Each cluster computes its own next PC
A1
A2
A3
A4
A5
B1
B2
B3
B4
…
…
A1
A2
A3
B1
B2
…
…
A4
A5
B3
B4
Conventional VLIW DVLIW
PC PC0
PC1
11 University of MichiganElectrical Engineering and Computer Science
Branch Mechanism
• Maintain correct execution order► All clusters transfer control at the same cycle► All clusters branch to the same logical multiop
• Unbundled branch in HPL-PD
Branch
PBR btr1, TARGET
CMPP pr0, (x>100)?
BR btr1, pr0
Each cluster specifies its own target
Broadcast to all clusters
Replicated in each cluster
12 University of MichiganElectrical Engineering and Computer Science
Branch Handling Example
…pbr btr1, BB2cmpp pr0, (x>100)?…br btr1, pr0
…pbr btr1, BB2cmpp pr0, (x>100)?bcast pr0br btr1, pr0
…pbr btr1, BB2’….….br btr1, pr0
Conventional VLIW DVLIW
Cluster 0 Cluster 1
13 University of MichiganElectrical Engineering and Computer Science
Sleep Mode
• Idle blocks after distribution
• Put cluster into sleep mode
► Compiler managed► Save energy► Reduce code size
• Mode change happens at block boundary
BR
Cluster 0 Cluster 1
BRSLEEP
WAKE BR BR
14 University of MichiganElectrical Engineering and Computer Science
Experimental Setup
• Trimaran toolset• Processor configuration
► 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster► 16K L1 I-cache total► Perfect data cache assumed
• Power Model► Verilog for instruction align/shift logic► Wire model► Cacti cache model
• 21 benchmarks from MediaBench and SPECINT2000
15 University of MichiganElectrical Engineering and Computer Science
1
10
100
1000
10000
100000
1000000ra
wca
udio
raw
daud
io
g721
enco
de
g721
deco
de
gsm
enco
de
gsm
deco
de
epic
unep
ic
pegw
itdec
pegw
itenc
rast
a
cjpe
g
djpe
g
mpe
g2en
c
mpe
g2de
c
164.
gzip
256.
bzip
2
181.
mcf
197.
pars
er
255.
vort
ex
300.
twol
f
aver
age
Tra
ffic
Rat
io
Icmove increase Fetch reduction Total reduction
Change in Global Communication Bits
MediaBench SPECINT
16 University of MichiganElectrical Engineering and Computer Science
Normalized Energy Consumption on Control Path
Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy)
40% saving 67% saving 80% saving 21% saving
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
raw
caud
io
raw
daud
io
g721
enco
de
g721
deco
de
gsm
enco
de
gsm
deco
de
epic
unep
ic
pegw
itdec
pegw
itenc
rast
a
cjpe
g
djpe
g
mpe
g2en
c
mpe
g2de
c
164.
gzip
256.
bzip
2
181.
mcf
197.
pars
er
255.
vort
ex
300.
twol
f
aver
age
Nor
mal
ized
Ene
rgy
Con
sum
ptio
n
17 University of MichiganElectrical Engineering and Computer Science
Normalized Code Size
Baseline: Conventional VLIW with compressed encodingTraditional method (single PC): 7x increase DVLIW: 40% increase
0123456789
10
raw
caud
io
raw
daud
io
g721
enco
de
g721
deco
de
gsm
enco
de
gsm
deco
de
epic
unep
ic
pegw
itdec
pegw
itenc
rast
a
cjpe
g
djpe
g
mpe
g2en
c
mpe
g2de
c
164.
gzip
256.
bzip
2
181.
mcf
197.
pars
er
255.
vort
ex
300.
twol
f
aver
age
Traditional Method DVLIW
18 University of MichiganElectrical Engineering and Computer Science
Result Summary
• DVLIW benefits► Order of magnitude reduction in global
communication► 40% savings in control path energy► 5x code size reduction vs. simple distribution
• Small overhead for ILP execution on CMP► 3% increase in execution cycles► 4% increase in I-cache stalls
19 University of MichiganElectrical Engineering and Computer Science
Conclusions
• DVLIW removes last centralized resource in a multicluster VLIW
► Fully distributed control path► Scalable architecture
• More energy efficient• Stylized CMP architecture
► Exploit ILP► Multiple instruction streams► Compiler orchestrated
20 University of MichiganElectrical Engineering and Computer Science
Thank You
• For more information► http://cccp.eecs.umich.edu