Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for

Networks-On-Chip

Lei Wang, Yuho Jin, Hyungjun Kim and

Eun Jung KimDepartment of Computer Science and Engineering

Texas A&M University

Lei Wang - NOCS 2009 2

Multi-Core Wave & Networks-On-Chip

Uniprocessors hit the power wall. Multi-processors provide high performance at lower power budget.

Shared-bus architecture has scalability limitation. Networks-On-Chip (NOCs) orchestrate chip-wide communications towards

future many-core processors.

MIT Raw (0.18um, 300MHz)16-core chipFour 4x4 mesh networks

Intel Polaris (65nm, 4GHz)80-core chip8x10 mesh network


Challenges in On-Chip Communication

High performance Low communication latency is critical for high system performance.

Bandwidth-efficient Well-designed routing algorithms provide high network throughput.

Power and Area Constraints Simple topologies and slim routers reduce communication power c

onsumption and save chip area. Efficient Multicast supporting

Cache coherence protocols heavily rely on multicast or broadcast communication characteristics.

We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption.


Prior Work in Multicast Communication

Routing Evaluation Criteria for Multicast Communication [Ni93] Multicast in multicomputer system

Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] Short message multicast in DSM system

Virtual Circuit Tree Multicasting for NOCs[Lipasti08] Demonstrate necessity of multicasting on-chip Propose table-based multicast routing

Region-based Multicast for CMPs [Duato08] Multicast routing for irregular topology in CMPs


Outline

Motivation Multicast Router Design

State-of-art Unicast Router Architecture Replication Schemes Destination List Management

Recursive Partitioning Multicast (RPM) Network Partitioning Routing Rules Example Deadlock Avoidance

Evaluation Conclusion


Different Bandwidth Usage Example

Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals

Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Source

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Destination


State-of-Art Wormhole Unicast Router

Output 4

RouteComputation

VCAllocatorSwitch

Allocator

VC 1

VC 2

VC n

Input buffers

VC 1

VC 2

VC n

Input buffers

Input 0

Input 4

Output 0

.

.

.

.

.

.

Crossbar switch

RC VA SA ST LT

RCVASA

ST LT

Router Link

LinkRouter

RC: Route Computation VA: VC Allocation; SA: Switch Allocation

ST: Switch Traversal; LT: Link Traversal


What we need in a Multicast Router?

Packet Replication Synchronous Replication Asynchronous Replication

Destination List Management All-destination Encoding Bit String Encoding Multiple-region Broadcast Encoding


Synchronous Replication

Packet replication happens at Switch Traversal Stage.

Input 0

Input 3

Output 0

Output 1

Output 2

Output 3

Input 1

Input 2

T M M H

3210

Time (Cycle)

HM

H

M

Head flit

Middle flit

T Tail flit


Asynchronous Replication

Input 0

Input 3

Output 0

Output 1

Output 2

Output 3

Input 1

Input 2

T M M H

3210

Time (Cycle)

HMM

H

M

Head flit

Middle flit

T Tail flit


Network Partitioning

Three Parts (5, 6, 7)

Three Parts (0, 1, 7)

Three Parts (3, 4, 5) Three Parts (1, 2, 3)

Source node

Eight Parts

N

S

EW

01

2

3

4

5

7

8


Basic Routing Rules

NE

SW

NE

SW

Source

Destination

N

S

EW

North: top right corner. West: top left corner. South: bottom left corner. East: bottom right corner.


Optimized Routing Rules

Source

Destination

Deadlock!!!


RPM Example-step 1

MM

MSource DestinationMulticast Packet Partitioning


RPM Example-step 2

M

MM

Ejection



RPM Example-step 3

M

MM



RPM Example-step 4

M

M MM

Ejection Ejection

Ejection



RPM Example-step 5

M

Ejection

M



Deadlock Avoidance RPM has no turn restrictions, potentially introducing deadlock. We use Virtual Network (VN) to avoid deadlock.

Two VNs lie in the same physical network. Virtual Channels of each port are equally divided into each virtual network

. Virtual network Id (0 or 1) for each packet is decided at the source.

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Virtual Network 0

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Virtual Network 1


Evaluation Methodology Performance Model: Cycle-accurate Network Simulator

Models all router pipeline stages in detail Highly parameterized

Power Model: Orion with both dynamic and leakage power models

Topology 8×8 Mesh (6×6 Mesh, 10×10 Mesh, 16×16 Mesh)

Routing RPM

VC/Port 4

VC Depth 4

Packet Length (flits) 4

Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose)

Multicast Packet Portion 10% (5%, 20%, 40%, 80%)

Multicast Destination Number

0 -16 (uniformly distributed)

Network configuration


Uniform Random Traffic

Latency is improved around 50% before network saturation. Network throughput is extended 40%.

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15

Injection rate (flits/cycle/core)

La

ten

cy (

cycl

e)

RPM Mul unicast VCTM(20%) VCTM(40%) VCTM(80%)

50%

40%

40%


Link Utilization

00.05

0.10.15

0.20.25

0.30.35

0.40.45

0.01

0.03

0.05

0.07

0.09

0.15

0.25

0.35

0.45

Injection Rate (flits/cycle/core)

Lin

k U

tiliz

atio

n (

op

/cyc

le)

RPM VCTM(20%) VCTM(40%) VCTM(80%)

33%

45%

In low workload, RPM saves 33% link utilization. In high workload, RPM saves 45% link utlization.


Dynamic Power Consumption

02

46

810

12

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

0.010.020.030.040.050.060.070.080.09 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Injection Rate(flits/cycle/core)

Dyn

am

ic P

ow

er(

W)

Buffer VC Arbiter SW Arbiter Xbar Link

40%50%


Scalability Study-Network Size

0

20

40

60

80

100

120

140

6×6 8×8 10×10 16×16

Network Size

La

ten

cy (

cycl

e)

RPM VCTM

Over 50%


Scalability Study-Multicast Traffic Portion

0

20

40

60

80

100

120

140

5% 10% 20% 40% 80% 100%

Portion of multicast traffic

Late

ncy

(cyc

le)

RPM VCTM


Scalability Study-Destination Number

0

20

40

60

80

100

120

140

4 8 16 32

Max. number of destinations

Late

ncy

(cyc

le)

RPM VCTM


Conclusion

Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) Bandwidth-efficient and Scalable

Performance Improvement Up to 50% latency reduction 33% link utilization reduction

Power Savings Up to 40% total dynamic power savings 25% crossbar and link power savings


Thank you!


Backup


Hardware Implementation of Routing logic


Bit Complement Traffic

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15


Late

ncy

(cyc

le)

RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)


Transpose Traffic

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15


Late

ncy

(cyc

le)

RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Documents

Transcript of Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip