January 30, 2008 Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs258
Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan. 10/3/2015 CS258 S99 2 What is Dynamic Network...
-
Upload
homer-mosley -
Category
Documents
-
view
213 -
download
0
Transcript of Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan. 10/3/2015 CS258 S99 2 What is Dynamic Network...
Dynamic Networks
CS 213, LECTURE 15L.N. Bhuyan
04/21/23 CS258 S99 2
What is Dynamic Network
• Dynamic Network is the network that can connect any input to any output by enabling or disabling some switches in the network
• Examples: - Shared Bus: The bus arbiter connects a
processor to a memory - Crossbar: Consists of a lot of switching
elements, which can be enabled to connect many inputs to many outputs simultaneously
- Multistage Network: Consists of several stages of switches that are enabled to get connections
- The nodes in static networks (like Mesh) also consist of dynamic crossbars
04/21/23 CS258 S99 3
Crossbar Switch Design
• Complexity O(N**2) for an NXN Crossbar
Cross-bar
InputBuffer
Control
OutputPorts
Input Receiver Transmiter
Ports
Routing, Scheduling
OutputBuffer
04/21/23 CS258 S99 4
How do you build a crossbar
Io
I1
I2
I3
Io I1 I2 I3
O0
Oi
O2
O3
N**2 switches => Cost O(N**2)
Time taken by the arbiter = O(N**2)Multiplexors are controlled from the arbiter/controller/scheduler
From Control
04/21/23 CS258 S99 5
Crossbar Contd.
• An NXN Crossbar allows all N inputs to be connected simultaneously to all N outputs
• It allows all one-to-one mappings, called permutations. No. of permutations = N!
• When two or more inputs request the same output, it is called CONFLICT. Only one of them is connected and others are either dropped or buffered
• When processors access memories through crossbar, this situation is called memory access conflicts
• Given p as the probability of request by a processor per cycle and assuming that each of N processors’ request is uniformly directed to all N memories, the average number of connections allowed per cycle, called Bandwidth (BW) is
BW = N{1- (1-p/N)**N} – Derive this!!!
04/21/23 CS258 S99 6
Input buffered swtich
• Independent routing logic per input • Scheduler logic arbitrates each output - priority, FIFO, random• Head-of-line blocking problem – The head packet in a buffer cannot depart because the
output is busy with another packet. The second packet may be destined to an output that is free, but cannot depart due to blocking by the first packet => One solution is to create multiple input queues, one per output, called Virtual Output Queuing – adopted in most routers.
• Scheduler Design – How to ensure maximum simultaneous connections is a challenging research area.
Cross-bar
OutputPorts
Input Ports
Scheduling
R0
R1
R2
R3
04/21/23 CS258 S99 7
Problems with Input-Buffered Switch
• FIFO Input buffers give rise to Head of the Line (HOL) problem
• Current routers employ a separate input queue for each output, called virtual output queue (VOQ)
• Then how to schedule the packets from different VOQ’s for transmission?
04/21/23 CS258 S99 8
VOQ-based Input Buffered Switch
04/21/23 CS258 S99 9
Scheduling in Input Buffered Switch
• n independent arbitration problems?– static priority, random, round-robin
• simplifications due to routing algorithm?• general case is max bipartite matching – Iterative
algorithms – iSLIP in Cisco
Cross-bar
OutputPorts
R0
R1
R2
R3
O0
O1
O2
InputBuffers
04/21/23 CS258 S99 10
Output Buffered Switch
• How would you build a shared pool?
Control
OutputPorts
Input Ports
OutputPorts
OutputPorts
OutputPorts
R0
R1
R2
R3
04/21/23 CS258 S99 11
Output scheduling
• n independent arbitration problems?– static priority, random, round-robin
• simplifications due to routing algorithm?
• general case is max bipartite matching
Cross-bar
OutputPorts
R0
R1
R2
R3
O0
O1
O2
InputBuffers
04/21/23 CS258 S99 12
Multistage Interconnection Network (MIN)
Crossbar switch is not scalable. How about a network consisting of multiple stages of small crossbar switches? Has the following properties.
• NxN network for N=2n
• Consists of log2N stages of 2x2 switches
• Has N/2 2x2 switches per stage
• Cost O(N log n) instead of O(N2) for Crossbar
• For N= an, a MIN can be similarly designed with axa switches
04/21/23 CS258 S99 13
Multistage interconnection networks
0
1
2
3
4
5
6
7
000
001
010
011
100
101
110
111
1
1
0
Complexity: Omega Network Complexity O(Nlog2N)Self Routing: The source node generates a tag, which is binary equivalent
Of the destination. At each switch, the corresponding tag bit is checked.If the bit is 0, the input is connected to the upper output. If it is 1, the
Input is connected to the lower output. If both inputs have either 0 or 1,It is a switch conflict. One of them is connected. The other one is rejected or
buffered at the switch (if it has buffer => buffered crossbar)
04/21/23 CS258 S99 14
What is Shuffle?
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
=0
=1
=2
=3
=4
=5
=6
=7
(a) Perfect shuffle (b) Inverse perfect shuffle
shuffle interconnection
S(an-1 an-2 … a1 a0) = (an-2 an-3 … a0 an-1 )
04/21/23 CS258 S99 15
Omega Network
• Every stage of switches is preceded by a perfect shuffle interconnection
S(an-1 an-2 … a1 a0) = (an-2 an-3 … a0 an-1 )• An input can be connected to a straight or
exchange output in a 2x2 switch.
E(an-1 an-2 … a1 a0) = (an-1 an-2 … a1 ā0) • To route a message/packet in an Omega network,
the destination tag which is binary equivalent of the destination is used, (dn-1 dn-2 … d1 d0). The ith bit di is used to control the routing at the ith stage counted from the right with 0 <= i <= n-1. If di = 0, the input is connected to the upper output. If di = 1, it is connected to the lower output.
04/21/23 CS258 S99 16
Self Routing
• A processor generates a tag that is binary equivalent of the destination
• MSB controls the leftmost stage and the lsb controls the rightmost stage of the Omega network. A small controller inside the 2 x 2 switch senses this bit and enables the connection
• If bit ci = 0, the request is to the upper output; if it is 1, the request is to the lower output.
• Based on digit if switch size is greater than 2
• Network conflict - Select Round Robin
• Less Bandwidth than crossbar, but more cost effective
• What about QoS? Future research
04/21/23 CS258 S99 17
Theorem: The Omega network is self routing
Let source be (sn-1sn-2 … s2 … s1s0) and destination be (dn-1dn-2 … d2 … d1d0). Before Stage 1, the source is switched to the position (sn-2sn-3 … s1 … s0sn-1) due to perfect shuffle connection. After Stage 1 it is switched to (sn-2sn-3 … s1 … s0dn-
1) as per the (n-1)th of the destination.
Before 2nd stage of the switches, the source is connected to (sn-3 … s0dn-1sn-2) as after 2nd stage it becomes (sn-3 … s0dn-1dn-2)
If we continue like this for n stages, the source matches (dn-1dn-2 … di … d1d0) which is the destination.
04/21/23 CS258 S99 18
Switch Size axa
Let N = a**n
• The MIN will consist of n stages of axa crossbar switches with N/a switches per stage.
• The routing will be based on digit (a-1) <= I => 0 based on radix a
• Interconnection based on a-shuffle
Home Work: Prove self routing based on radix a. Draw a 16x16 MIN based
on 4x4 switches and explain its operation
Derive the BW of an Omega network with N=a**n with same input parameters as Crossbar (Slide 5)
04/21/23 CS258 S99 19
Example: SP
• 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single 40 MHz clock
• packet sw, cut-through, no virtual channel, source-based routing
• variable packet <= 255 bytes, 31 byte fifo per input, 7 bytes per output, 16 phit links
P0P1P2P3 P15
E0E1E2E3 E15
Intra-Rack Host Ports
Inter-Rack External Switch Ports
16-node Rack
SwitchBoard
Multi-rack Configuration
04/21/23 CS258 S99 20
Example: IBM SP vulcan switch
• Many gigabit ethernet switches use similar design without the cut-through
FIFO
CRCcheck
Routecontrol
FlowControl
8 8
Des
eria
lizer
64
Input Port
RAM64x128
InArb
OutArb
8 x 8Crossbar
CentralQueue
FIFO
CRCGen
FlowControl
8 8Seri
aliz
er
64
Ouput Port
XBarArb
FIFO
CRCcheck
Routecontrol
FlowControl
8 8
Des
eria
lize
rInput Port
°°°
64
°°°
FIFO
CRCGen
FlowControl
8 8Ser
ializ
er
Ouput Port
XBarArb
8
°°°
8
04/21/23 CS258 S99 21
SGI SPIDER Chip
04/21/23 CS258 S99 22
SPIDER OPERATION
• The physical transmission layer for each port is based on a pair of Source Synchronous Drivers and Receivers (SSD and SSR), which transmit and receive 20 data bits and a data framing signal at 400 MBaud.
• The data link level guarantees reliable transmission using a CCITT-CRC code with a go-back-n sliding window protocol [1] retry mechanism, and is referred to as the Link Level Protocol (LLP).
• The message layer defines 4 virtual channels and a credit based flow control scheme to support arbitrary message lengths, as well as a header format to specify message destination, priority, and congestion control options.
• The receive buffers of a port maintain a separate linked list of messages for each of the 5 possible output ports for each virtual channel to avoid the ‘block at head of queue’ bottleneck.
04/21/23 CS258 S99 23
SPIDER Crossbar Arbitration
• To maximize bandwidth through the crossbar without using unreasonable buffering, each virtual channel buffer is organized as a set of linked lists. There is one linked list for each possible output port for each virtual channel. This solution avoids the block at head of queue problem. To maximize crossbar efficiency, each virtual channel from each port can request arbitration for every possible destination. Each arbitration cycle, the arbiter chooses up to 6 winners from as many as 120 arbitration candidates to maximize crossbar utilization.
• Messages accumulate a network age as they are routed, increasing their priority to avoid starvation and promote network fairness. In order to avoid starvation and encourage network fairness, the arbiter is rotated each arbitration cycle to favor the highest priority requestor. Priority is based on the age field of a message header.
04/21/23 CS258 S99 24
Arbitration Contd.
• After data is received by the SSR and synchronized, it enters the chip core and begins several operations in parallel. Table lookup and crossbar arbitration is normally serialized, as the exit port must be known before arbitration begins.
• To parallelize these operations, table lookup is pipelined across SPIDER chips. While arbitration progresses. the table lookup is performed for the next SPIDER chip, which depends on the destination ID and the direction field. This does increase table size, as a full table is required for each neighboring SPIDER chip, but it reduces latency by a full clock. Pipelined tables also add flexibility to possible routes, as different exit ports can be given depending on where a messages came from as well as where it is going.
04/21/23 CS258 S99 25
Summary
• Routing Algorithms restrict the set of routes within the topology
– simple mechanism selects turn at each hop
– arithmetic, selection, lookup
• Deadlock-free if channel dependence graph is acyclic– limit turns to eliminate dependences
– add separate channel resources to break dependences
– combination of topology, algorithm, and switch design
• Deterministic vs. adaptive routing
• Switch design issues– input/output/pooled buffering, routing logic, selection logic
• Flow control
• Real networks are a ‘package’ of design choices