A Distributed Control Path Architecture for VLIW Processors
description
Transcript of A Distributed Control Path Architecture for VLIW Processors
![Page 1: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/1.jpg)
1 University of MichiganElectrical Engineering and Computer Science
A Distributed Control Path Architecture for VLIW Processors
Hongtao Zhong, Kevin Fan, Scott Mahlke,and Michael Schlansker*
Advanced Computer Architecture LaboratoryUniversity of Michigan
*HP Laboratories
![Page 2: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/2.jpg)
2 University of MichiganElectrical Engineering and Computer Science
Motivation
• VLIW Scaling Problem► Centralized resource► Highly ported structures► Wire delays
FU FU
Register File
Instruction Fetch/Decode
FU FU…FU FU
Register File
Instruction Fetch/Decode
FUFU FUFU
![Page 3: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/3.jpg)
3 University of MichiganElectrical Engineering and Computer Science
Multicluster VLIW
• Distribute register files• Cluster function units• Distribute data caches• Clusters communicate
through interconnection network
• Used in TI C6x, Lx/ST200, Analog Tigersharc
FU FU FU FU
Register FileRegister File
Interconnection network
Instruction Fetch/Decode
Cluster 0 Cluster 1
![Page 4: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/4.jpg)
4 University of MichiganElectrical Engineering and Computer Science
Control Path Scaling Problem• Larger I-cache
• Latency► Long wires for control
signals distribution
• Code compression► Hardware cost, power► Grow quadratically with the
number of FUs
GFED
CBAX
PC
BA
I-cache
IR
align/shiftnetwork
NOP NOP
![Page 5: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/5.jpg)
5 University of MichiganElectrical Engineering and Computer Science
Straight Forward Approach• Distribute I-fetch in spirit similar to distribution of
data path► Local communication of controls► Reduce latency, hardware cost, power
• Used in Multiflow Trace 14/300 processors
I-cache
PC
IR
Interconnection network
PC
FU FU FU FU
Register FileRegister File
Interconnection network
I-cache
IR
FU FU FU FU
Register FileRegister File
![Page 6: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/6.jpg)
6 University of MichiganElectrical Engineering and Computer Science
DVLIW Approach
• Simple distribution has problems► Doesn’t support code compression► PC still a centralized resource
I-cache
FU FU FU FU
Register File Register File
PC0
IR
Interconnection network
I-cache
FU FU FU FU
Register File Register File
PC
IR
Interconnection network
align/shift
PC1
align/shift
![Page 7: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/7.jpg)
7 University of MichiganElectrical Engineering and Computer Science
DVLIW Execution Model
• Clusters execute in lock-step► When one cluster stalls, all clusters stall
• Clusters collectively execute one thread► Each cluster runs an instruction stream► Compiler orchestrates the execution of streams► Compiler manages communication► Light weight synchronization
![Page 8: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/8.jpg)
8 University of MichiganElectrical Engineering and Computer Science
DVLIW Benefits
• Completely decentralized architecture► Distributed data path► Distributed control path
• Supports arbitrary code compression
• Exploiting ILP on multi-core style system► Good for embedded applications► Low cost► Compiler support
![Page 9: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/9.jpg)
9 University of MichiganElectrical Engineering and Computer Science
DVLIW Architecture
VLIWCluster 0
VLIWCluster 1
VLIWCluster 3
VLIWCluster 2
Banked L2
br_target
PC
Next PC
BNOPA
BA
L1 D-C
acheL1 I-Cache
IR
Register Files
…
align/shift
IC MFUFU…
To Banked L2
Banked L2
To cluster 2 To cluster 1
![Page 10: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/10.jpg)
10 University of MichiganElectrical Engineering and Computer Science
Code Organization• Code for each cluster
is consecutive in memory
• Operations in the same MultiOp stored in different memory locations
• Each cluster computes its own next PC
A1
A2
A3
A4
A5
B1
B2
B3
B4
…
…
A1
A2
A3
B1
B2
…
…
A4
A5
B3
B4
Conventional VLIW DVLIW
PC PC0
PC1
![Page 11: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/11.jpg)
11 University of MichiganElectrical Engineering and Computer Science
Branch Mechanism
• Maintain correct execution order► All clusters transfer control at the same cycle► All clusters branch to the same logical multiop
• Unbundled branch in HPL-PD
Branch
PBR btr1, TARGET
CMPP pr0, (x>100)?
BR btr1, pr0
Each cluster specifies its own target
Broadcast to all clusters
Replicated in each cluster
![Page 12: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/12.jpg)
12 University of MichiganElectrical Engineering and Computer Science
Branch Handling Example
…pbr btr1, BB2cmpp pr0, (x>100)?…br btr1, pr0
…pbr btr1, BB2cmpp pr0, (x>100)?bcast pr0br btr1, pr0
…pbr btr1, BB2’….….br btr1, pr0
Conventional VLIW DVLIW
Cluster 0 Cluster 1
![Page 13: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/13.jpg)
13 University of MichiganElectrical Engineering and Computer Science
Sleep Mode
• Idle blocks after distribution
• Put cluster into sleep mode
► Compiler managed► Save energy► Reduce code size
• Mode change happens at block boundary
BR
Cluster 0 Cluster 1
BRSLEEP
WAKE BR BR
![Page 14: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/14.jpg)
14 University of MichiganElectrical Engineering and Computer Science
Experimental Setup
• Trimaran toolset• Processor configuration
► 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster► 16K L1 I-cache total► Perfect data cache assumed
• Power Model► Verilog for instruction align/shift logic► Wire model► Cacti cache model
• 21 benchmarks from MediaBench and SPECINT2000
![Page 15: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/15.jpg)
15 University of MichiganElectrical Engineering and Computer Science
1
10
100
1000
10000
100000
1000000ra
wca
udio
raw
daud
io
g721
enco
de
g721
deco
de
gsm
enco
de
gsm
deco
de
epic
unep
ic
pegw
itdec
pegw
itenc
rast
a
cjpe
g
djpe
g
mpe
g2en
c
mpe
g2de
c
164.
gzip
256.
bzip
2
181.
mcf
197.
pars
er
255.
vort
ex
300.
twol
f
aver
age
Tra
ffic
Rat
io
Icmove increase Fetch reduction Total reduction
Change in Global Communication Bits
MediaBench SPECINT
![Page 16: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/16.jpg)
16 University of MichiganElectrical Engineering and Computer Science
Normalized Energy Consumption on Control Path
Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy)
40% saving 67% saving 80% saving 21% saving
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
raw
caud
io
raw
daud
io
g721
enco
de
g721
deco
de
gsm
enco
de
gsm
deco
de
epic
unep
ic
pegw
itdec
pegw
itenc
rast
a
cjpe
g
djpe
g
mpe
g2en
c
mpe
g2de
c
164.
gzip
256.
bzip
2
181.
mcf
197.
pars
er
255.
vort
ex
300.
twol
f
aver
age
Nor
mal
ized
Ene
rgy
Con
sum
ptio
n
![Page 17: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/17.jpg)
17 University of MichiganElectrical Engineering and Computer Science
Normalized Code Size
Baseline: Conventional VLIW with compressed encodingTraditional method (single PC): 7x increase DVLIW: 40% increase
0123456789
10
raw
caud
io
raw
daud
io
g721
enco
de
g721
deco
de
gsm
enco
de
gsm
deco
de
epic
unep
ic
pegw
itdec
pegw
itenc
rast
a
cjpe
g
djpe
g
mpe
g2en
c
mpe
g2de
c
164.
gzip
256.
bzip
2
181.
mcf
197.
pars
er
255.
vort
ex
300.
twol
f
aver
age
Traditional Method DVLIW
![Page 18: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/18.jpg)
18 University of MichiganElectrical Engineering and Computer Science
Result Summary
• DVLIW benefits► Order of magnitude reduction in global
communication► 40% savings in control path energy► 5x code size reduction vs. simple distribution
• Small overhead for ILP execution on CMP► 3% increase in execution cycles► 4% increase in I-cache stalls
![Page 19: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/19.jpg)
19 University of MichiganElectrical Engineering and Computer Science
Conclusions
• DVLIW removes last centralized resource in a multicluster VLIW
► Fully distributed control path► Scalable architecture
• More energy efficient• Stylized CMP architecture
► Exploit ILP► Multiple instruction streams► Compiler orchestrated
![Page 20: A Distributed Control Path Architecture for VLIW Processors](https://reader036.fdocuments.us/reader036/viewer/2022081505/56815b01550346895dc8b383/html5/thumbnails/20.jpg)
20 University of MichiganElectrical Engineering and Computer Science
Thank You
• For more information► http://cccp.eecs.umich.edu