Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip
description
Transcript of Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip
Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for
Networks-On-Chip
Lei Wang, Yuho Jin, Hyungjun Kim and
Eun Jung KimDepartment of Computer Science and Engineering
Texas A&M University
Lei Wang - NOCS 2009 2
Multi-Core Wave & Networks-On-Chip
Uniprocessors hit the power wall. Multi-processors provide high performance at lower power budget.
Shared-bus architecture has scalability limitation. Networks-On-Chip (NOCs) orchestrate chip-wide communications towards
future many-core processors.
MIT Raw (0.18um, 300MHz)16-core chipFour 4x4 mesh networks
Intel Polaris (65nm, 4GHz)80-core chip8x10 mesh network
Lei Wang - NOCS 2009 3
Challenges in On-Chip Communication
High performance Low communication latency is critical for high system performance.
Bandwidth-efficient Well-designed routing algorithms provide high network throughput.
Power and Area Constraints Simple topologies and slim routers reduce communication power c
onsumption and save chip area. Efficient Multicast supporting
Cache coherence protocols heavily rely on multicast or broadcast communication characteristics.
We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption.
Lei Wang - NOCS 2009 4
Prior Work in Multicast Communication
Routing Evaluation Criteria for Multicast Communication [Ni93] Multicast in multicomputer system
Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] Short message multicast in DSM system
Virtual Circuit Tree Multicasting for NOCs[Lipasti08] Demonstrate necessity of multicasting on-chip Propose table-based multicast routing
Region-based Multicast for CMPs [Duato08] Multicast routing for irregular topology in CMPs
Lei Wang - NOCS 2009 5
Outline
Motivation Multicast Router Design
State-of-art Unicast Router Architecture Replication Schemes Destination List Management
Recursive Partitioning Multicast (RPM) Network Partitioning Routing Rules Example Deadlock Avoidance
Evaluation Conclusion
Lei Wang - NOCS 2009 6
Different Bandwidth Usage Example
Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals
Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Source
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Destination
Lei Wang - NOCS 2009 7
State-of-Art Wormhole Unicast Router
Output 4
RouteComputation
VCAllocatorSwitch
Allocator
VC 1
VC 2
VC n
Input buffers
VC 1
VC 2
VC n
Input buffers
Input 0
Input 4
Output 0
.
.
.
.
.
.
Crossbar switch
RC VA SA ST LT
RCVASA
ST LT
Router Link
LinkRouter
RC: Route Computation VA: VC Allocation; SA: Switch Allocation
ST: Switch Traversal; LT: Link Traversal
Lei Wang - NOCS 2009 8
What we need in a Multicast Router?
Packet Replication Synchronous Replication Asynchronous Replication
Destination List Management All-destination Encoding Bit String Encoding Multiple-region Broadcast Encoding
Lei Wang - NOCS 2009 9
Synchronous Replication
Packet replication happens at Switch Traversal Stage.
Input 0
Input 3
Output 0
Output 1
Output 2
Output 3
Input 1
Input 2
T M M H
3210
Time (Cycle)
HM
H
M
Head flit
Middle flit
T Tail flit
Lei Wang - NOCS 2009 10
Asynchronous Replication
Input 0
Input 3
Output 0
Output 1
Output 2
Output 3
Input 1
Input 2
T M M H
3210
Time (Cycle)
HMM
H
M
Head flit
Middle flit
T Tail flit
Lei Wang - NOCS 2009 11
Network Partitioning
Three Parts (5, 6, 7)
Three Parts (0, 1, 7)
Three Parts (3, 4, 5) Three Parts (1, 2, 3)
Source node
Eight Parts
N
S
EW
01
2
3
4
5
7
8
Lei Wang - NOCS 2009 12
Basic Routing Rules
NE
SW
NE
SW
Source
Destination
N
S
EW
North: top right corner. West: top left corner. South: bottom left corner. East: bottom right corner.
Lei Wang - NOCS 2009 13
Optimized Routing Rules
Source
Destination
Deadlock!!!
Lei Wang - NOCS 2009 14
RPM Example-step 1
MM
MSource DestinationMulticast Packet Partitioning
Lei Wang - NOCS 2009 15
RPM Example-step 2
M
MM
Ejection
MSource DestinationMulticast Packet Partitioning
Lei Wang - NOCS 2009 16
RPM Example-step 3
M
MM
MSource DestinationMulticast Packet Partitioning
Lei Wang - NOCS 2009 17
RPM Example-step 4
M
M MM
Ejection Ejection
Ejection
MSource DestinationMulticast Packet Partitioning
Lei Wang - NOCS 2009 18
RPM Example-step 5
M
Ejection
M
MSource DestinationMulticast Packet Partitioning
Lei Wang - NOCS 2009 19
Deadlock Avoidance RPM has no turn restrictions, potentially introducing deadlock. We use Virtual Network (VN) to avoid deadlock.
Two VNs lie in the same physical network. Virtual Channels of each port are equally divided into each virtual network
. Virtual network Id (0 or 1) for each packet is decided at the source.
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Virtual Network 0
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Virtual Network 1
Lei Wang - NOCS 2009 20
Evaluation Methodology Performance Model: Cycle-accurate Network Simulator
Models all router pipeline stages in detail Highly parameterized
Power Model: Orion with both dynamic and leakage power models
Topology 8×8 Mesh (6×6 Mesh, 10×10 Mesh, 16×16 Mesh)
Routing RPM
VC/Port 4
VC Depth 4
Packet Length (flits) 4
Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose)
Multicast Packet Portion 10% (5%, 20%, 40%, 80%)
Multicast Destination Number
0 -16 (uniformly distributed)
Network configuration
Lei Wang - NOCS 2009 21
Uniform Random Traffic
Latency is improved around 50% before network saturation. Network throughput is extended 40%.
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.15
Injection rate (flits/cycle/core)
La
ten
cy (
cycl
e)
RPM Mul unicast VCTM(20%) VCTM(40%) VCTM(80%)
50%
40%
40%
Lei Wang - NOCS 2009 22
Link Utilization
00.05
0.10.15
0.20.25
0.30.35
0.40.45
0.01
0.03
0.05
0.07
0.09
0.15
0.25
0.35
0.45
Injection Rate (flits/cycle/core)
Lin
k U
tiliz
atio
n (
op
/cyc
le)
RPM VCTM(20%) VCTM(40%) VCTM(80%)
33%
45%
In low workload, RPM saves 33% link utilization. In high workload, RPM saves 45% link utlization.
Lei Wang - NOCS 2009 23
Dynamic Power Consumption
02
46
810
12
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
RP
MV
CT
MR
PM
VC
TM
0.010.020.030.040.050.060.070.080.09 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Injection Rate(flits/cycle/core)
Dyn
am
ic P
ow
er(
W)
Buffer VC Arbiter SW Arbiter Xbar Link
40%50%
Lei Wang - NOCS 2009 24
Scalability Study-Network Size
0
20
40
60
80
100
120
140
6×6 8×8 10×10 16×16
Network Size
La
ten
cy (
cycl
e)
RPM VCTM
Over 50%
Lei Wang - NOCS 2009 25
Scalability Study-Multicast Traffic Portion
0
20
40
60
80
100
120
140
5% 10% 20% 40% 80% 100%
Portion of multicast traffic
Late
ncy
(cyc
le)
RPM VCTM
Lei Wang - NOCS 2009 26
Scalability Study-Destination Number
0
20
40
60
80
100
120
140
4 8 16 32
Max. number of destinations
Late
ncy
(cyc
le)
RPM VCTM
Lei Wang - NOCS 2009 27
Conclusion
Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) Bandwidth-efficient and Scalable
Performance Improvement Up to 50% latency reduction 33% link utilization reduction
Power Savings Up to 40% total dynamic power savings 25% crossbar and link power savings
Lei Wang - NOCS 2009 28
Thank you!
Lei Wang - NOCS 2009 29
Backup
Lei Wang - NOCS 2009 30
Hardware Implementation of Routing logic
Lei Wang - NOCS 2009 31
Bit Complement Traffic
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.15
Injection Rate (flits/cycle/core)
Late
ncy
(cyc
le)
RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)
Lei Wang - NOCS 2009 32
Transpose Traffic
0
20
40
60
80
100
120
0.01 0.03 0.05 0.07 0.09 0.15
Injection Rate (flits/cycle/core)
Late
ncy
(cyc
le)
RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)