Module R R RRR R RRRRR RR R R R R Quality of Service in Network on Chip Isask’har (Zigi) Walter...
-
date post
21-Dec-2015 -
Category
Documents
-
view
222 -
download
2
Transcript of Module R R RRR R RRRRR RR R R R R Quality of Service in Network on Chip Isask’har (Zigi) Walter...
Module
Module
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
Module
R
R
R
Quality of Service in Network on Chip
Isask’har (Zigi) Walter
Supervised by:Prof. Israel Cidon, Prof. Ran Ginosar
and Dr. Avinoam Kolodny
February, 2008 NoC Seminar 2
Outline
Network on Chip (NoC) and QNoC Capacity Allocation (Joint work with Zvika
Guz) Hot Modules in Wormhole NoCs Summary
Module
HS
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
R
R
R
Module
February, 2008 NoC Seminar 3
System on Chip (SoC) Interconnect
Explosion in the number of modules in a single chip
Networks are replacing system busses
Low areaLow powerBetter scalability
Higher parallelismSpatial reuseUnicast
February, 2008 NoC Seminar 4
Grid topology Packet-switched XY Routing Service-levels Wormhole hop-to-hop
flow-control
QNoC Architecture
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
ModuleModule Module Module Module
ModuleModule Module Module Module
ModuleModule Module Module Module
R
R
R
R
R R
R
R
R
RR R R R
RR R R R
RR R R R
R
Router Link
E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny, “QoS Architecture and Design Process for Cost-Effective Network on Chip”, Journal of Systems Architecture, 2004
February, 2008 NoC Seminar 5
Data D7
Wormhole Flow-Control
D0
D1D2
D3
D4
D5D6
Dest
.TYPESL
TYPESL
TYPESL
TYPESL
TYPESL
TYPESL
TYPESL
TYPESL
TYPESL
Flit-based communication
Flit
SL: Service level (0/1/2/3)Type: Head/Body/Tail flit
Destination appears (only) in the header flit Each flit must include a Type field
February, 2008 NoC Seminar 6
IP1
Inte
rfac
e
IP2
Wormhole Routing
Interface
Suits well on chip interconnect
Small number of buffers
Low latency Virtual Channels
forconcurrent flits transmission on the same link- Flits of different
packets are locally labeled
February, 2008 NoC Seminar 7
Quality of Service in QNoC
Defined by throughput and latency requirements- e.g. Interrupts, real time, block transfers
- Implemented using separated buffers (service levels) and static priority policy
Requirements should be met at low cost- Design parameters- Run-time mechanisms
High Bandwidth
Low Latency
February, 2008 NoC Seminar 8
Module
Module
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
ModuleModule
Module
Module
QNoC Design Flow
Define inter-module traffic
Place modules
Allocate link capacities
Verify QoS and cost
R
R
R R R
R
RR R R R
R RR
R
R
R
R
R
RR
R
R
R
R
R
R
R R
R R
R
R
R
R
R
RR
R
R
February, 2008 NoC Seminar 9
Module
Module
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
R
R
R
Module
QNoC Design Flow
Module
Module
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
R
R
R
Module
Define inter-module traffic
Place modules
Allocate link capacities
Verify QoS and cost
Too low capacity results in poor QoS Too high capacity wastes power/area
Module
Module
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
R
R
R
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
R
R
R
Module
February, 2008 NoC Seminar 10
Use Existing Algorithms?…
Efficient algorithms exist for store-and-forward networks
These algorithms are useless for wormhole networks, as they ignore inter-link dependencies
February, 2008 NoC Seminar 11
Our Approach
Analytical model to forecast QoS Capacity Allocation algorithm that exploit
the model
Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny “Efficient Link Capacity and QoS Design for Wormhole Network-on Chip”, accepted to Design, Automation and Test (DATE), 2006
Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny , “Network Delays and Link Capacities in Application-Specific Wormhole NoCs”, VLSI Design, 2007
February, 2008 NoC Seminar 12
Delay Analysis - Goal
s1
d2
s2
d1
R
R
R R R
R
RR R R R
R R
R
R
R
R
Replace extensive simulations, with an analytical model to forecast QoS
Approximate per-flow latencies Given:
- Network topology- Communication demands- Link capacities
February, 2008 NoC Seminar 13
Though many wormhole analyses exists, they don’t fit because they assume:- symmetrical communication demands - no virtual channels- identical link capacity!
Generally, they calculate the delay of an “average flow”- A per-flow analysis is needed
Delay Analysis – Prior work 1/4
February, 2008 NoC Seminar 14
Delay Analysis – Prior work 2/4
H. Sarbazi-Azad, A. Khonsari and M. Ould-Khaoua, “Performance Analysis of Deterministic Routing in Wormhole K-ary n-cubes with Virtual-Channels”, Journal of Interconnection Networks, 2002
February, 2008 NoC Seminar 15
Delay Analysis – Prior work 3/4
Approximate the delay of an “average flow”
H. Sarbazi-Azad, A. Khonsari and M. Ould-Khaoua, “Performance Analysis of Deterministic Routing in Wormhole K-ary n-cubes with Virtual-Channels”, Journal of Interconnection Networks, 2002
February, 2008 NoC Seminar 16
Delay Analysis – Prior work 4/4
S. Loucif and M. Ould-Khaoua, “Modeling Latency in Deterministic Wormhole-Routed Hypercubes under Hot-Spot Traffic”, The Journal of Supercomputing, 2004
February, 2008 NoC Seminar 17
Wormhole Delay Analysis
Network
TopologyCommunication Demands
Links’ Capacity
Per-flow Latencies
February, 2008 NoC Seminar 18
Delay Analysis - Basics
Focus on long packets Packet transmission can be divided
into two separated phases:- Path acquisition- Flits’ transmission
For simplicity, we assume “enough” VCs on every link- Path acquisition time is negligible
February, 2008 NoC Seminar 19
IP1
Inte
rfac
e
IP2Interface
Main Observation
The delivery resembles a pipeline pass
February, 2008 NoC Seminar 20
IP1
Inte
rfac
e
IP2Interface
The delivery time of long packets is dominated by the slowest link- Transmission
rate- Link sharing
Packet Delivery Time
Low-capacity link
February, 2008 NoC Seminar 21
IP1
Inte
rfac
e
Interface Interface
IP2
Packet Delivery Time
The delivery time of long packets is dominated by the slowest link- Transmission
rate- Link sharing
IP3
February, 2008 NoC Seminar 22
Analysis Basics
Determines the flow’s effective bandwidth Per link
Account for interleaving
t
t
February, 2008 NoC Seminar 23
Single Hop Flow, no Sharing
- mean time to deliver a flit of flow i over link j [sec]
- capacity of link j [bits per sec] - flit length [bits/flit]
1
1ij
jl
tC
ijt
jC
l
February, 2008 NoC Seminar 24
The Effect of Sharing
H. Sarbazi-Azad, A. Khonsari and M. Ould-Khaoua, “Performance Analysis of Deterministic Routing in Wormhole K-ary n-cubes with Virtual-Channels”, Journal of Interconnection Networks, 2002
Use heuristics to model “flit interleaving delay” of each link on its path
February, 2008 NoC Seminar 25
- mean time to deliver a flit of flow i over link j
- capacity of link j [bits per second] - flit length [bits/flit] - total flit injection rate of all flows sharing link j
except for flow i [flits/sec]
1
1ij i
j jl
tC
ijtjC
ij
Single Hop Flow, with Sharing
l
Bandwidth used by
other flows on link j
February, 2008 NoC Seminar 26
The Convoy Effect
Consider inter-link dependencies - Wormhole backpressure - Traffic jams down the road
| ( , )ij
i ii i k kj j i
k k k
l tt t
C dist j k
Link Load
Account for all subsequent hops Basic delay
weighted by distance
February, 2008 NoC Seminar 27
Total Packet Transmission Time
Slowest link dominates transmission time
max( | )i i i ijT m t j
Packet size[flits/packet]
Account for weakest link
February, 2008 NoC Seminar 29
Analysis Validation
Analytical model was validated using simulations- Different link capacities- Different communication
demands
Analysis and Simulation vs. Load
Utilization
No
rmal
ized
Lo
ad
February, 2008 NoC Seminar 31
Capacity Allocation Problem Use the delay analysis to solve an
optimization problem
Given:- System topology and routing- Each flow’s bandwidth (fi ) and delay
bound (TiREQ)
Minimize total link capacity Such that:
: i iREQflow i T T
ee E
C
February, 2008 NoC Seminar 32
Capacity Allocation Algorithm
Greedy, iterative algorithm
For each src-dst pair: Use delay model to identify most sensitive link
Increase its capacity Repeat until delay requirements are met
February, 2008 NoC Seminar 33
Capacity Allocation – Example#1 A simple 4-by-4 system with uniform traffic pattern
and uniform requirements “Classic” design: 74.4Gbit/sec Using the delay model and algorithm: 69Gbit/sec Total capacity reduced by 7%
Before optimization
After optimization00
0102
03
1011
1213
2021
2223
3031
3233
February, 2008 NoC Seminar 35
DVD Decoder - Results A SoC-like system with specific traffic demands and
delay requirements “Classic” design: 41.8Gbit/sec Using the algorithm: 28.7Gbit/sec Total capacity reduced by 30%
After optimization
Before optimization00
0102
03
1011
1213
2021
2223
February, 2008 NoC Seminar 38
Example#3 - VOPD Application
Video Object Plane Decoder “Classic” design: 640Gbit/sec Using the algorithm: 369Gbit/sec Total capacity reduced by 40%
February, 2008 NoC Seminar 39
Summary
Capacity Allocation- Simple analytical model, capturing
multiple VCs, different link capacities, different communication demands
- Allocation algorithm that reduces network cost
February, 2008 NoC Seminar 40
Future Work
Extensions- Finite number of VCs- Analytical delay modeling- Allocation algorithm
New Applications- Core Placement- Topology selection- Routing
February, 2008 NoC Seminar 41
Outline
NoC and QNoC Capacity Allocation (Joint work with Zvika
Guz) Hot Modules in QNoC Summary
Module
HS
Module
Module
Module
Module
Module
Module
Module Module Module
Module
Module
Module
R
R
R R R
R
RR R R R
R R
R
R
R
R
Module
February, 2008 NoC Seminar 42
Hot-Modules
NoC is designed and dimensioned to meet QoS requirements- Buffer sizing, routing, router arbitration, link capacities, …
NoC designers cannot tune everything- Modules typically have limited capacity
High-demanded, bandwidth limited modules create edge bottlenecks- In SoC, often known in advance
Off-chip DRAM, on-chip special purpose processor
System performance is strongly affected- Even if the NoC has infinite bandwidth
February, 2008 NoC Seminar 43
Hot Module (HM) in NoC Wormhole, BE NoC
At high Hot Module utilization, multiple worms “get stuck” in the network
Two problems arise:- System Performance- Source Fairness
IP(HM) In
terf
ace
February, 2008 NoC Seminar 44
IP3Interface
IP2
Inte
rfa
ce
IP1(HM) In
terf
ace
HM is not a local problem. Traffic not destined at the HM suffers too!
Hot Module Affects the SystemProblem
#1
February, 2008 NoC Seminar 45
Multiple locally fair decisions
Global fairness
HM
Inte
rfac
e
The limited, expensive HM resource isn’t fairly shared
Source Fairness ProblemProblem
#2
February, 2008 NoC Seminar 46
216
BW
IP
IP
IP
R
R
R
18
BW
54
BW
54
BW
54
BW
108
BW
108
BW
216
BW
72
BW
24
BW
Saturation (Un)Fairness
BW
2
BW
A saturated router divides available BW equally between inputs
4
BW
HM IP IP
IP IP IP
IP IP IP
8
BW
4
BW
8
BW
RR R
RR R
RR R
6
BW
6
BW
6
BW
12
BW
12
BW
18
BW
18
BW
2
BW
36
BW
36
BW
24
BW
72
BWLess than 1% of
HM BW!
February, 2008 NoC Seminar 48
Related Work Hotspots solution were comprehensively studied in
the last two decades (e.g. Pfister and Norton 1985, Duato et al., 2005)
Classically, solutions are categorized by the mechanism policy- Avoidance-based (frequently impossible)- Detection-based (requires threshold tuning)- Prevention-based (overhead during light load)
And by the mechanism implementation- Central arbitration- Router-based- End-to-end flow-control Seem to draw
most attention
February, 2008 NoC Seminar 49
Router-Based Solutions
X-Bar
Input Buffer
Output Buffer
Solving HS by routers- Virtual circuit- Fair queuing- Dedicated queues- Deflective routing- Packet combining- Packet dropping- Backpressure
(credit/rate based)- and more…
Routers can(?) detect congested periods- Easier in store-and-forward networks
February, 2008 NoC Seminar 50
Router-Based Solutions
QNoC routers are simple
Fast, power and area efficient- A few buffers- Efficient routing- Simple arbitration
policy- No state/flow
memory
X-Bar
Input Buffer
Output Buffer
February, 2008 NoC Seminar 51
Related Work
Examples:- “Self-Tuned Congestion Control for Multiprocessor
Networks”, M. Thottethodi, A. R. Lebeck and S. Mukherjee, HPCA 2000
- “A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks”, J. Duato, I. Johnson, J. Flich, F. Naven, P. Garcia and T. Nachiondo, HPCA 2005
A few end-to-end solutions do exist- Stop-and-wait based- Do not prevent hotspot effects- Do not address fairness problem
February, 2008 NoC Seminar 52
Our Approach
Problem is not caused by the NoC- But rather by a congested end-point
Solution should address the root cause- Not the symptoms
Utilize existing NoC infrastructure
Solve both problems- Simple and efficient
February, 2008 NoC Seminar 53
Hot Module Congestion
During congested periods, sources should not inject packets towards the HM- Will experience increased delay anyway- Better wait at the source, not in the network
Keep routers unmodified!
February, 2008 NoC Seminar 54
IP1
Control
IP4
NoC
Interface
Interface
IP3
IP2(HM)
HM Allocation Control Basics
Inte
rfac
eA
llocati
on
Con
trolle
r
Interface
February, 2008 NoC Seminar 55
IP1
IP4
NoC
Interface
Interface
IP3
IP2(HM)
Inte
rfac
eControl
HM Allocation Control Basics
Allocati
on
Con
trolle
r
Interface
February, 2008 NoC Seminar 56
IP1
Control
NoCIP2
(HM)
Allocati
on
Con
trolle
r
Interface
IP3
IP4
Interface
HM Allocation Control Basics
Inte
rfac
e
Interface
February, 2008 NoC Seminar 57
HM Control Packets
The HM Controller receives all requests and can employ any scheduling policy
Control packets are sent using a high service level- Bypassing (blocked) data packets!
Dest.
Req. C
redit
Source
Dest.
Credit
Source
Credit request packet Credit reply packet
February, 2008 NoC Seminar 58
QNoC Router
CR
OS
S-B
AR
SchedulerControlRouting
CREDIT
B uffe rsSIG NAL
RT
RD /W R
BLO CK
SIG NAL
RT
RD /W R
BLO CK
CREDIT
SchedulerControlRouting
CREDIT
SIG NAL
RT
RD /W R
BLO CK
SIG NAL
RT
RD /W R
BLO CK
CREDIT
O utput portsInput ports
Input
Port
#1
Input
Port
#5
Outp
ut
Port
#1
Outp
ut
Port
#5
R
R R
R
Module
R
Module
R
Module
R
R
R
Module
February, 2008 NoC Seminar 59
Enhanced Request packet The request may include additional data as
needed- payload’s priority, deadline, expiration time, etc.
Dest.
Deadline
Expiration
Priority
Req. C
redit
Source
……
Optional fields
Credit request packet
February, 2008 NoC Seminar 60
SRC
Size
Priority
deadline
Expiration……
The HM Allocation Controller is customized according to system’s requirements
HM Allocation Controller
PendingRequests
Table
LocalArbiter
CreditRequests
CreditReplies
Requests Decoder
Reply Encoder
Optional
HM Access Controller
February, 2008 NoC Seminar 61
Short packets are not negotiated Source’s quota is slowly self-refreshing The mechanism is turned-off when the
network is not congested Crediting modules ahead of time hides
request-grant latency- For light-load periods
Further Enhancements
February, 2008 NoC Seminar 62
Not Classic Flow-Control
Flow-control protects destination’s buffer- A pair-wise protocol
HM access regulation protects the system- Many-to-one protocol
February, 2008 NoC Seminar 63
Results – Synthetic scenario Hotspot traffic
- All-to-one traffic with all-to-all background traffic
High network capacity Limited hot module bandwidth HM controller arbitration: Round-robin
Module
Module
HM
Module
Module
Module
Module
Module
ModuleModule Module Module
ModuleModule Module Module
R
R
R
R
R R
R
RR R R
RR R R
R
February, 2008 NoC Seminar 64
System Performance
Without regulation
WithRegulation
X30
X10
Average Packet Latency
February, 2008 NoC Seminar 65
Hot vs. non-Hot Module Traffic
HM Trafficwithout regulation
Background TrafficWithout regulation
HM Trafficwith regulation
Background TrafficWith regulation
Using regulation, non-HM traffic latency is drastically reduced
X40
Average Packet Latency
February, 2008 NoC Seminar 66
Source Fairness
Source#16no regulation
Source#5no regulation
Source#5with regulation
Source#16with regulation
2
6
1
5
3
7
4
8
109 11 12
1413 15 16
R
R
R
R
R R
R
RR R R
RR R R
R
February, 2008 NoC Seminar 67
Fairness in Saturated Network
Hot-Module Utilization: 99.99% Regulated Hot-Module Utilization: 98.32%
Simulation results for a 4-by-4 system,Data packet length: 200 flitsControl packet length: 2 flits
No allocation controlWith allocation control
February, 2008 NoC Seminar 68
MPEG-4 Decoder
Real SoC Over provisioned NoC Two hot-modules
VU AU MED CPU
RAST
SDRAM SRAM1 SRAM2 IDCT
ADSP UP SAMP
BAB RISC
25% of all traffic
22% of all traffic
SDRAM SRAM2
February, 2008 NoC Seminar 69
Results – MPEG-4 Decoder
@80% load: X2 reduction @80% load: X8 reduction
All traffic HM/non-HM traffic breakdown
X2
X8
February, 2008 NoC Seminar 70
The HMs are better utilized
Without regulation, the hot-modules are only 60% utilized- Traffic to one HM blocks the traffic to the other!
No allocation controlWith allocation control
1HM1 2HM1 3HM1 4HM1 9HM1 10HM1 11HM1 8HM2 10HM2 11HM2 12HM2 Total
Flows destined at HM1
Significant differences in BW!
Flows destined at
HM2
February, 2008 NoC Seminar 72
Future Work
Dynamically set hot-modules Other scheduling policies at hot-
module controller Single/Multiple control modules for
multiple HMs Effect of Placement
February, 2008 NoC Seminar 73
Summary Hot-modules are common in real SoCs
Hot-modules ruin system performance and are not fairly shared- Even in NoCs with infinite capacity- The network intensifies the problem- But can also provide tools for resolving it
Simple mechanism achieves dramatic improvement- Completely eliminating the HM effects
Hot-Modules, Cool NoCs!
February, 2008 NoC Seminar 74
Thank you!
Questions?
Hot-Modules, Cool NoCs!
M odule
M odule M odule
M odule M odule
M odule M odule
M odule
M odule
M odule
M odule
M odule
QNoCResearch
GroupGroup
ResearchQNoC