Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network...

Post on 17-Jan-2016

215 views 0 download

Transcript of Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network...

Performance, Cost, and Energy Evaluation of Fat H-

Tree:

A Cost-Efficient Tree-BasedOn-Chip Network

Hiroki Matsutani (Keio Univ, JAPAN)Michihiro Koibuchi (NII, JAPAN)

Hideharu Amano (Keio Univ, JAPAN)

Introduction• Network-on-Chips

– Tile architecture– On-chip routers– Packet switching

• Various NoC topologies– Mesh, Torus– H-Tree, Fat Trees

• Fat H-Tree (FHT)

• Evaluations of FHT– Performance– Area– EnergyA mesh-based on-chip network

0 1 2

3 4 5

6 7 8

Tile (RISC, DSP, RAM, I/O)

We proposed FHT as an alternative to Fat Trees

NoCs’ topologies: Mesh & Torus

• 2-D Mesh • 2-D Torus– 2x bandwidth of meshRAW [Taylor, IEEE Micro’02]

Router Core

Fat H-Tree is a tree-based topology, but it includes a torus

structure

NoCs’ topologies: Fat Trees

• Fat Tree (p, q, c)p: # of upward linksq: # of downward

linksc: # of core ports

Router Core

Fat Tree (2,4,2)Fat Tree (2,4,1)

Rank-1

Rank-2

Trees are duplicated in Fat Trees and Fat H-Tree, but the connection patterns of trees are different!

Outline• NoCs’ topologies

– Mesh, Torus– H-Trees, Fat Trees

• Fat H-Tree (FHT)– Structure– 2-D layout– Routing algorithm (DTR)

• Evaluations of FHT– Network logic area– Energy consumption– Throughput

Fat H-Tree: Structure

• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)

[Yamada, EUC’04]

Combining two H-Trees (red & black)

Router Core Router Core

Location of black tree is shifted lower-right direction of red tree

By shifting the location of black tree, the connection pattern of trees

are different from original Fat Trees

Fat H-Tree: Structure

• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)

[Yamada, EUC’04]

Combining two H-Trees (red & black)

Router Core Router Core

Fat H-Tree is formed on red & black trees

Fat H-Tree: Structure

• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)

[Yamada, EUC’04]

Combining two H-Trees (red & black)

Router Core Router Core

Fat H-Tree is formed on red & black trees

Fat H-Tree: Structure

• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)

[Yamada, EUC’04]

Combining two H-Trees (red & black)

Router Core Router Core

Fat H-Tree is formed on red & black trees

Fat H-Tree: Structure

• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)

[Yamada, EUC’04]

Combining two H-Trees (red & black)

Router Core Router Core

Rank-2 or upper routers are omitted in this figure

Each core is connected to

both red & black trees

Ring is formed with cores & rank1

routers

Torus-level performance by combing only two H-Trees

Fat H-Tree: 2-D layout on VLSI

• Fat H-Tree– Torus structure Folded as well as the folded layout of 2-D Torus

Fat H-Tree’s 2-D layoutRouter Core

Topologically equivalent

(Long feedback links across chip)

Fat H-Tree: Routing algorithm

• Paths on a single H-tree– Only red tree, or– Only black tree

Only red tree 6-

hopOnly black

tree 6-hop

Fat H-Tree: Routing algorithm

• Paths on a single H-tree– Only red tree, or– Only black tree

• Paths across trees– Transit between

trees– Minimum paths

Firstly red is used

Then black is used, total 4-hop (minimum)

Transit!

Exploiting such paths is key for improving the

performance

Fat H-Tree: Dual tree routing (DTR)

• Dual tree routing– Transit trees for

minimum paths– Cycles across trees

• Deadlock avoidance– VC# is increased

when a packet transits from red to black

VC#0 is used

VC#1 is used

Transit!

Sufficient number of VCs is only TWO in 64-node FHT

Outline• NoCs’ topologies

– Mesh, Torus– H-Trees, Fat Trees

• Fat H-Tree (FHT)– Structure– 2-D layout– Routing algorithm (DTR)

• Evaluations of FHT– Network logic area– Energy consumption– Throughput

Ideal throughput: Channel bisection

Bandwidth of FHT is much improved by the torus structure

N=16 N=64 N=256

HT 4 4 4 4

FT1 8 16 32

FT2 16 32 64

FHT 24 40 72

Mesh 8 16 32

Torus 16 32 64

FT1: Fat Tree(2,4,1) FT2: Fat Tree(2,4,2)

nn 22N

1n2

2n2

2n2

1n2

82 2n

due to torus

due to two H-Trees

Number of routers

Router count of FHT is less than Fat Tree(2,4,2)

N=16 N=64 N=256

HT 5 21 85

FT1 6 28 120

FT2 12 56 240

FHT 10 42 170

Mesh 16 64 256

Torus 16 64 256

FT1: Fat Tree(2,4,1) FT2: Fat Tree(2,4,2)

nn 22N

2/)24( nn nn 24

N

3/)14(2 n

3/)14( n

N

Note number of NI is not considered.

FHT requires 2-port NIs for red & black

Network logic area (routers & NIs)

• Synthesis of NoC– 16-core, 64-core– Design Compiler– 0.18um CMOS

• Router architecture– 1-flit = 32-bit– 4-stage pipeline– Wormhole, 2VCs

• NI architecture– In: 2-flit FIFO– Out: 2-flit FIFO

CrossbarInput Ports

Buf

Wormhole router

Buf

Buf

Buf

2VCs

2VCs

FHT’s NI is implemented as a “router” to forward packets

between trees

Synthesis result (64-

core)

Network logic area: 16/64-core

Synthesis result (16-

core)

Network logic area of FHT is smaller than Fat Tree(2,4,2)

FHT’s NI is larger than others

Total wire length of all links

• Total unit-length of links– Core router– Router router

1-unit link

1-unit link

How many unit-links would FHT require?

1-unit = distance between neighboring cores

N=16 N=64 N=256

HT 24 112 480

FT1 32 192 1,024

FT2 64 384 2,048

FHT 72 392 1,800

Mesh 24 112 480

Torus 48 224 960

FT1: Fat Tree(2,4,1) FT2: Fat Tree(2,4,2)

nn 22N

nN

)2(2 nN 1

1

2

)12(88

n

nN

nN2

)2(4 nN

n

nN

2

)12(2

Wire length of FHT is almost the same as Fat Tree(2,4,2)

Energy: NoC’s energy model

• Ave. flit energy– Send 1-flit to dest.– How much

energy[J] ?

• Parameters– 12mm square chip– 16/64-core– 0.18um CMOS

• Switching energy– 1-bit switching @ router– Gate-level sim– 1.88 [pJ / hop]– 1.27 [pJ / hop]– 1.45 [pJ / hop]

• Link energy– 1-bit transfer @ link– 0.67 [pJ / mm]

flitE

swE

linkE)( linkswaveflit EEHwE

[Wang, DATE’05]

12mm

for routers

for NI

for NI(fht)

Energy consumption: 16/64-core

Simulation result (16-

core)

Energy consumption of FHT is less than Fat Tree(2,4,2)

Simulation result (64-

core)

Throughput: Simulation environment

• Flit-level simulation– Throughput / latency– 16/64-core

• Topology (routing)– Mesh, Torus (DOR)– Fat Trees (up/down)– Fat H-Tree (DTR)

• Traffic patterns– Uniform– BT.W– SP.W– CG.W– MG.W– IS.W

Packet size 16-flit (1-flit header)Buffer size 1-flit per channel

Switching Wormhole

# of VCs 2Latency 3-cycle per 1-hop

NAS Parallel Benchmark

FHT vs. FTs: Uniform (16/64-core)• FHT (DTR) • Fat Tree(2,4,2)• Fat Tree(2,4,1)

FHT outperforms FT2 in 16-core,but it doesn’t in 64-core

Uniform (16-core) Uniform (64-core)

FHT(DTR) causes

congestion around root of

trees

FHT vs. FTs: BT (16/64-core)

BT has neighboring communications. Advantage for FHT(DTR)

BT traffic (64-core)

• FHT (DTR) • Fat Tree(2,4,2)• Fat Tree(2,4,1) FHT(DTR)

doesn’t cause congestion

around roots

BT traffic (16-core)

FHT vs. FTs: MG (16/64-core)

Performance is … FHT(DTR) > FT2 > FT1

MG traffic (16-core) MG traffic (64-core)

• FHT (DTR) • Fat Tree(2,4,2)• Fat Tree(2,4,1)

Summary: Evaluations of FHT

• Performance– FHT outperforms Fat Tree (FT2), except for

uniform

• Network logic area– FHT requires 20.5%-28.1% smaller area than FT2

• Energy consumption– FHT requires 6.7%-7.0% less energy than FT2

• Wire length– Wire length of FHT is almost the same as FT2

• Ongoing works– Evaluation in 90nm CMOS– 3-D layout of FHT for 3-D NoCs

wafer

wafer

wafer

(stacked ICs)

Thank you for your attention

Feasibility of Fat H-Tree

• Total wire length– Slightly longer than Fat Trees– But a lot of wire resources are available on-chip

• Wire delay– Length of the longest wire is same as Fat Trees

Fat Tree (2,4,1)Fat H-Tree

If Fat Trees are feasible, Fat H-Tree can be implemented with smaller area but higher

performance

Routings for FHT: Torus routing(TOR)

• Single tree (STR)– Select a single tree

per packet– Can’t transit trees

• Dual tree (DTR)– Transit trees for

minimal paths– VCs are needed

• Torus routing (TOR)– Use torus formed

with rank1 & cores– VCs are needed

Fat H-Tree’s torus structure

Can’t use rank-2 or upper

routers

To avoid congestion around roots, but non-minimal paths

FHT vs. Torus: Uniform (16/64-core)

• FHT (DTR): • FHT (TOR): • 2-D Torus• 2-D Mesh

Minimum routing using links around roots

Using torus structure (can’t use links around roots)

Uniform (64-core)

FHT achieves torus-level throughput using only torus structure

Uniform (16-core)

Number of VCs in Dual Tree Routing

• # of VCs required is– H_max is the longest hop count in the

network

• E.g.,– 16-core FHT requires 2VCs– 64-core FHT requires 2VCs– …

14/max H

VC# is increased when a packet transits red to

black

Two VCs is not so costly…

NIs in Fat H-Tree• Implemented as a

“simplified router”– Connecting red & black

trees

• Routing @ NI is simple– Forward packets to another

tree if dst is not me

Processing Core

Crossbar

for red tree for black tree

Fat H-Tree

Synthesis result (64-

core)

Network logic area: 16/64-core

Synthesis result (16-

core)

Network logic area of FHT is smaller than Fat Tree(2,4,2)

FHT’s NI is larger than others

• Fat H-Tree– Minimum routing (DTR)

routing N=16 N=64 N=256

FT up/down 3.60 5.43 7.36

FHT DTR 3.20 4.84 6.78

Mesh DOR 2.67 5.33 10.67

Torus DOR 2.13 4.06 8.03

FHT offers shorter average hop count than Fat Trees

Average hop count

Nyx,

y)(x,2ave HN-N

H1

FT: Fat Trees

Wire length of links

• Case studies– 16-core (1-unit = 3.0mm)– 64-core (1-unit = 1.5mm)

1-unit = 3mm

Utilization rate of wire resources in 2 metal layers (%)

1-unit = 1.5mm

Flit-width = 32-bit @ 12mm square chip

12mm

N=16 N=64

HT 1.6% 3.7%

FT1 2.1% 6.4%

FT2 4.3% 12.8%

FHT 4.8% 13.1%

Mesh 1.6% 3.7%

Torus 3.2% 7.5%

Wire length of FHT is almost the same as Fat Tree(2,4,2)

Routings for FHT: Single tree (STR)

• Single tree (STR)– Select a single tree

per packet– Can’t transit trees

• Dual tree (DTR)– Transit trees for

minimal paths– VCs are needed

• Torus routing (TOR)– Use torus formed

with rank1 & cores– VCs are needed

Case 1: red tree 6-hop

Case 2: black tree 4-hop

Routings for FHT: Dual tree (DTR)

• Single tree (STR)– Select a single tree

per packet– Can’t transit trees

• Dual tree (DTR)– Transit trees for

minimal paths– VCs are needed

• Torus routing (TOR)– Use torus formed

with rank1 & cores– VCs are needed

Firstly red is used

Then black is used

# of VC is increased when a packet transits red to

black

Fat H-Tree: Structure

• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)

[Yamada, EUC’04]

Combining two H-Trees (red & black)

Router Core Router Core

Both edges are connected (folded)

By shifting and folding black tree, the connection pattern of trees are

different from original Fat Trees