A Distributed Control Path Architecture for VLIW Processors

1 University of MichiganElectrical Engineering and Computer Science

A Distributed Control Path Architecture for VLIW Processors

Hongtao Zhong, Kevin Fan, Scott Mahlke,and Michael Schlansker*

Advanced Computer Architecture LaboratoryUniversity of Michigan

*HP Laboratories


Motivation

• VLIW Scaling Problem► Centralized resource► Highly ported structures► Wire delays

FU FU

Register File

Instruction Fetch/Decode

FU FU…FU FU

Register File


FUFU FUFU


Multicluster VLIW

• Distribute register files• Cluster function units• Distribute data caches• Clusters communicate

through interconnection network

• Used in TI C6x, Lx/ST200, Analog Tigersharc

FU FU FU FU

Register FileRegister File

Interconnection network


Cluster 0 Cluster 1


Control Path Scaling Problem• Larger I-cache

• Latency► Long wires for control

signals distribution

• Code compression► Hardware cost, power► Grow quadratically with the

number of FUs

GFED

CBAX

PC

BA

I-cache

IR

align/shiftnetwork

NOP NOP


Straight Forward Approach• Distribute I-fetch in spirit similar to distribution of

data path► Local communication of controls► Reduce latency, hardware cost, power

• Used in Multiflow Trace 14/300 processors

I-cache

PC

IR


PC

FU FU FU FU



I-cache

IR

FU FU FU FU



DVLIW Approach

• Simple distribution has problems► Doesn’t support code compression► PC still a centralized resource

I-cache

FU FU FU FU

Register File Register File

PC0

IR


I-cache

FU FU FU FU

Register File Register File

PC

IR


align/shift

PC1

align/shift


DVLIW Execution Model

• Clusters execute in lock-step► When one cluster stalls, all clusters stall

• Clusters collectively execute one thread► Each cluster runs an instruction stream► Compiler orchestrates the execution of streams► Compiler manages communication► Light weight synchronization


DVLIW Benefits

• Completely decentralized architecture► Distributed data path► Distributed control path

• Supports arbitrary code compression

• Exploiting ILP on multi-core style system► Good for embedded applications► Low cost► Compiler support


DVLIW Architecture

VLIWCluster 0

VLIWCluster 1

VLIWCluster 3

VLIWCluster 2

Banked L2

br_target

PC

Next PC

BNOPA

BA

L1 D-C

acheL1 I-Cache

IR

Register Files

…

align/shift

IC MFUFU…

To Banked L2

Banked L2

To cluster 2 To cluster 1


Code Organization• Code for each cluster

is consecutive in memory

• Operations in the same MultiOp stored in different memory locations

• Each cluster computes its own next PC

A1

A2

A3

A4

A5

B1

B2

B3

B4

…

…

A1

A2

A3

B1

B2

…

…

A4

A5

B3

B4

Conventional VLIW DVLIW

PC PC0

PC1


Branch Mechanism

• Maintain correct execution order► All clusters transfer control at the same cycle► All clusters branch to the same logical multiop

• Unbundled branch in HPL-PD

Branch

PBR btr1, TARGET

CMPP pr0, (x>100)?

BR btr1, pr0

Each cluster specifies its own target

Broadcast to all clusters

Replicated in each cluster


Branch Handling Example

…pbr btr1, BB2cmpp pr0, (x>100)?…br btr1, pr0

…pbr btr1, BB2cmpp pr0, (x>100)?bcast pr0br btr1, pr0

…pbr btr1, BB2’….….br btr1, pr0

Conventional VLIW DVLIW

Cluster 0 Cluster 1


Sleep Mode

• Idle blocks after distribution

• Put cluster into sleep mode

► Compiler managed► Save energy► Reduce code size

• Mode change happens at block boundary

BR

Cluster 0 Cluster 1

BRSLEEP

WAKE BR BR


Experimental Setup

• Trimaran toolset• Processor configuration

► 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster► 16K L1 I-cache total► Perfect data cache assumed

• Power Model► Verilog for instruction align/shift logic► Wire model► Cacti cache model

• 21 benchmarks from MediaBench and SPECINT2000


1

10

100

1000

10000

100000

1000000ra

wca

udio

raw

daud

io

g721

enco

de

g721

deco

de

gsm

enco

de

gsm

deco

de

epic

unep

ic

pegw

itdec

pegw

itenc

rast

a

cjpe

g

djpe

g

mpe

g2en

c

mpe

g2de

c

164.

gzip

256.

bzip

2

181.

mcf

197.

pars

er

255.

vort

ex

300.

twol

f

aver

age

Tra

ffic

Rat

io

Icmove increase Fetch reduction Total reduction

Change in Global Communication Bits

MediaBench SPECINT


Normalized Energy Consumption on Control Path

Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy)

40% saving 67% saving 80% saving 21% saving

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

raw

caud

io

raw

daud

io

g721

enco

de

g721

deco

de

gsm

enco

de

gsm

deco

de

epic

unep

ic

pegw

itdec

pegw

itenc

rast

a

cjpe

g

djpe

g

mpe

g2en

c

mpe

g2de

c

164.

gzip

256.

bzip

2

181.

mcf

197.

pars

er

255.

vort

ex

300.

twol

f

aver

age

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n


Normalized Code Size

Baseline: Conventional VLIW with compressed encodingTraditional method (single PC): 7x increase DVLIW: 40% increase

0123456789

10

raw

caud

io

raw

daud

io

g721

enco

de

g721

deco

de

gsm

enco

de

gsm

deco

de

epic

unep

ic

pegw

itdec

pegw

itenc

rast

a

cjpe

g

djpe

g

mpe

g2en

c

mpe

g2de

c

164.

gzip

256.

bzip

2

181.

mcf

197.

pars

er

255.

vort

ex

300.

twol

f

aver

age

Traditional Method DVLIW


Result Summary

• DVLIW benefits► Order of magnitude reduction in global

communication► 40% savings in control path energy► 5x code size reduction vs. simple distribution

• Small overhead for ILP execution on CMP► 3% increase in execution cycles► 4% increase in I-cache stalls


Conclusions

• DVLIW removes last centralized resource in a multicluster VLIW

► Fully distributed control path► Scalable architecture

• More energy efficient• Stylized CMP architecture

► Exploit ILP► Multiple instruction streams► Compiler orchestrated


Thank You

• For more information► http://cccp.eecs.umich.edu

A Distributed Control Path Architecture for VLIW Processors

Documents

Transcript of A Distributed Control Path Architecture for VLIW Processors