HPSR 2006 Distributed Crossbar Schedulers Cyriel Minkenberg 1, Francois Abel 1, Enrico Schiattarella...
-
Upload
dina-atkins -
Category
Documents
-
view
216 -
download
0
Transcript of HPSR 2006 Distributed Crossbar Schedulers Cyriel Minkenberg 1, Francois Abel 1, Enrico Schiattarella...
HPSR 2006
Distributed Crossbar Schedulers
Cyriel Minkenberg1, Francois Abel1, Enrico Schiattarella2 1 IBM Research, Zurich Research Laboratory2 Dipartimento di Elettronica, Politecnico di Torino
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Outline
OSMOSIS overview
Challenges in the OSMOSIS scheduler design
Basics of crossbar scheduling
Distributed scheduler Architecture
Problems
Solutions
Results
Implementation
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
OSMOSIS Overview
EQ
control
2 Rx
central scheduler(bipartite graph matching algorithm)
VOQs
Tx
control
64 Ingress Adapters
All-optical Switch
64 Egress Adapters
EQ
control
2 Rx
control links
8 Broadcast Units128 Select Units
8x11x88x1
Com- biner
Fast SOA 1x8FiberSelectorGates
Fast SOA 1x8FiberSelectorGates
Fast SOA 1x8WavelengthSelectorGates
Fast SOA 1x8WavelengthSelectorGates
OpticalAmplifier
WDM Mux
StarCoupler
8x1 1x128VOQs
Tx
control
1
8
1
128
1
64
1
64
request2 grant4a
centralscheduler
(BGM)
3
all-opticalpacket transfer
5
packetwaiting
1
4b SOA switch
command
64 ports @ 40 Gb/s, 256-byte cells => 51.2 ns time slot Broadcast-and-select architecture (crossbar) Combination of wavelength- and space-division multiplexing Fast switching based on SOAs Electronic input and output adapters, electronic arbitration
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Architectural Scheduler Challenges
Latency < 1 s Pr: Long permission latency (RTT + scheduling)
So: Speculation
Multicast support Pr: Fair integration with unicast scheduling, control channel overhead
So: Independent schedulers with filter, merge & feedback scheme
Scheduling rate = cell rate Pr: Produce one high quality matching every 51.2 ns
So: Deeply pipelined matching with parallel sub-schedulers (FLPPR)
FPGA-only scheduler implementation Pr: Does a 64-port scheduler fit in one FPGA device?
If not, how do we distribute it over multiple devices while maintaining an acceptable level of performance?
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Crossbar Scheduling: Bipartite Graph Matching
A crossbar is a non-blocking fabric that can transfer cells from any input to any output with the following constraints:
At most one cell from any input
At most one cell to any output Equivalent to Bipartite Graph Matching (BGM)
inpu
ts
outputs
inpu
ts
outputs
maximal,size=3
inpu
ts
outputs
maximum,size=4in
puts
outputs
maximal,size=2
requestmatrix
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
One matching must be computed in every time slot, so we need fast and simple algorithms
Suitable class of algorithms is parallel, iterative, and based on round-robin pointers i-SLIP (McKeown), DRRM (Chao)
These algorithms have a number of desirable features: 100% throughput under uniform i.i.d. traffic
Starvation-free: any VOQ is served within finite time under any traffic pattern
Iterative: sequential improvement of the matching by repeating steps
Amenable to fast hardware implementation; high degree of parallelism and symmetry
Pointer-based Parallel Iterative Matching
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
DRRM Operation
IS[1] OS[1]
IS[2] OS[2]
IS[3] OS[3]
IS[4] OS[4]
VOQstate
inputselectors
outputselectors Step 0: Initially, all inputs and outputs
are unmatched Step 1: Each unmatched input
requests the first unmatched output in round-robin order for which it has a packet, starting from pointer R[i]. R[i] (R[i] + 1) modulo N iff the request is granted in Step 2 of the first iteration
Step 2: Each output grants the first input in round-robin order that has requested it, starting from pointer G[o]. G[o] (G[o] + 1) modulo N
Iterate: Repeat Steps 1 and 2 until no more edges can be added or a fixed number of iterations are completed
Key to good performance is pointer desynchronization
If all VOQs are non-empty, pointers eventually all point to different outputs
No conflicts: maximum performance
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Distribution Issues Problem: Scheduler does not fit in a single device due to area constraints
Quadratic complexity growth of priority encoders
Monolithic implementation (implicit temporal and spatial assumptions) All results are available before the next time slot (or iteration) All required information is available to all selectors
Distributed implementation breaks these assumptions Main problem: input selector issues a request at t0 and receives result (granted or not) at t0 + RTT Input selector does not know results of requests issued during last RTT Selectors are only aware of local status info (e.g. matches made in previous iterations)
The time required for information to travel from the inputs to the outputs and back is called round-trip time (RTT)
= RTT / (cell duration)
IS[1]
OS[1]
IS[N]
OS[N]
RTT >> cell duration
request
grant
output selectionand status update
input status updateand selection
RTT
time
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Coping with Uncertainty (1)
Problem: Uncertainty in the algorithm’s status The pointer-update mechanism breaks
– No desynchronization Throughput loss
Solution: Maintain a separate pointer set for each time slot in the RTT Basic idea: No pointer is reused before the last result is available
– Each input (output) selector maintains distinct request (grant) pointers, labeled R[ t ] and G[ t ], with t [0, -1]
– At time slot t the input selectors use set R[t mod ] to generate requests; each request carries the ID of pointer set used
– Output selectors generate grants using G[ t ] in response to requests from R[ t ]
Each pointer set is updated independently from the others, so they all desynchronize independently. Therefore, all the good features DRRM are preserved
Pointer sets are only updated once every RTT, hence they take longer to desynchronize
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Coping with Uncertainty (2)
Problem: Uncertainty in the algorithm’s status The VOQ-state update mechanism breaks
– How many requests were successful?– Excess requests may lead to “wasted” grants, leading to reduced
performance
Solution: Maintain a pending request counter for every VOQ P(i,j) tracks the number of requests issued for VOQ(i,j) over the last RTT
– Increment when issuing new request– Decrement when result arrives
Filter requests: if P(i,j) exceeds the number of unserved cells in VOQ(i,j) do not submit further requests
This massively reduces the number of wasted grants
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Multi-pointer Approach (RTT = 4)
IS[1] OS[1]
OS[2]
OS[3]
OS[4]
outputselectors
R[t0;1]
R[t1;1]
R[t2;1]
R[t3;1]
1VOQstate
200
G[t0;1]
G[t1;1]
G[t2;1]
G[t3;1]
R[t3;2]R[t2;2]R[t1;2]R[t0;2]
R[t3;2]R[t2;2]R[t1;2]R[t0;3]
R[t3;2]R[t2;2]R[t1;2]R[t0;4]
requestpointer
set R[t3;2]R[t2;2]R[t1;2]G[t0;2]
R[t3;2]R[t2;2]R[t1;2]G[t0;3]
R[t3;2]R[t2;2]R[t1;2]G[t0;4]
IS[2]
IS[3]
IS[4]
inputselectors
requestpointers
grantpointers
grantpointer
set
Hardware cost ( -1) additional pointers at each
input/output, each log2N bits wide N2 pending request counters N -to-1 multiplexers Selection logic is not duplicated
pendingrequestcounters
00
11
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Multiple Iterations
Additional uncertainty: Which inputs/outputs have been matched in previous iterations?
1. Inputs should not request outputs that are already taken: Wasted requests
2. Outputs should not grant inputs that are already taken: Violation of one-to-one matching property
Because of issue 2 above, the output selectors must be aware of all grants in previous iterations, also by other selectors Implement all output selectors in one device
Input selectors use a request flywheel pointer to create request diversity across multiple iterations
PRC filtering applies only to first iteration Can lead to “premature” grants
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Distributed Scheduler Architecture
IS[1] OS[1]
IS[2] OS[2]
IS[3] OS[3]
IS[4] OS[4]
VOQstate
inputselectors
outputselectors
Allocators (on midplane)
control channel
control channel
control channel
control channelControl channel interfaces (each on a separate card)
switchcommandchannels
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Performance Characteristics (16 ports)
Uniform Bernoulli traffic, RTT = 4
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Throughput
Lat
ency
[ti
me
slo
ts] 1 iteration
2 iterations
3 iterations
4 iterations
8 iterations
16 iterations
monolithic
Uniform Bernoulli traffic, RTT = 10
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Throughput
Lat
ency
[ti
me
slo
ts] 1 iteration
2 iterations
3 iterations
4 iterations
8 iterations
16 iterations
monolithic
Uniform Bernoulli traffic, RTT = 20
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Throughput
Lat
ency
[ti
me
slo
ts] 1 iteration
2 iterations
3 iterations
4 iterations
8 iterations
16 iterations
monolithic
No PRCs, uniform Bernoulli traffic, RTT = 4
1
10
100
1000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Throughput
Lat
ency
[ti
me
slo
ts] 1 iteration
2 iterations
3 iterations
4 iterations
8 iterations
16 iterations
monolithic
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Optical Switch Controller Module (OSCM)
Midplane (OSCB; prototype shown here) with 40 daughter boards (OSCI; top right). Board layout (bottom right)
OSMOSIS
HPSR 2006 © 2006 IBM Corporation
Thank You!